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Preface 



The contributions in this volume give an overview of state-of-the-art results 
presented at the Workshop on 3D Structure from Multiple Images of Large- 
scale Environments (SMILE). This workshop was held in conjunction with the 
Fifth European Conference on Computer Vision 1998 in Freiburg, Germany. 
SMILE was a joint effort of the European ACTS projects VANGUARD and 
PANORAMA and the Esprit project CUMULI, all of which are involved in 
the analysis and reconstruction of 3D scenes from image sequences. 

The potential for 3D reconstructions of scenes and objects is tremendous. Much 
of the work reported here is to be seen especially against the background of 
a convergence between computer vision and computer graphics, and of a shift 
from signal-based to content-based image analysis in telecommunications. Ac- 
cordingly, the requirements for 3D models and acquisition systems are also shift- 
ing. Visualization rather than mensuration is the primary issue. The perceptual 
quality of the models, the flexibility of the acquisition, and the cost of the system 
are three driving forces in the search for new methods. 

The last few years have seen important steps toward genuine flexibility. A 
case in point is the use of multiple images to generate 3D models, without an 
explicit knowledge of the relative position of the cameras or the camera settings. 
The same developments also hold good promise to make 3D acquisition cheaper 
and more widely available. 

The contributions in this volume focus on the latest developments in this 
and related areas. They demonstrate the feasibility of generating highly realistic 
3D models from natural, uncalibrated video sequences, of using 3D models for 
telecommunications, and of integrating real and virtual objects into a single 
environment. 

This volume is divided into five thematic sections and an appendix. First, an 
overview is given of the work performed in the three organizing projects with 
links to the related contributions. In his invited presentation, Richard Hartley 
then develops ideas to exploit the duality concept between points and cameras 
in scene reconstruction. 

Section 2 discusses various approaches to formalize basic multiview relations 
and to reliably estimate image correspondences. 

Section 3 exploits results given in the previous section. Different approaches 
for the estimation of 3D scene structure from image sequences are presented, 
with an emphasis on fully automatic algorithms. 

Section 4 deals with the use of constraints to improve 3D modeling. Two 
basic approaches are discussed: introducing explicit geometrical constraints, and 
controlled user interaction. 
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Section 5 bridges the gap to real-world applications. The object models are 
integrated in real and virtual environments by using novel Augmented Reality 
techniques. 

An appendix has been included to give the nonspecialist reader a comprehen- 
sive and intuitive introduction to multi view relations. It explains the geometry 
behind the image relationships and develops the underlying mathematical con- 
cepts. 

We hope that you will enjoy this volume. 
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Abstract. This overview summarizes the goals of the European projects 
Cumuli, Panorama, and Vanguard and references the various contri- 
butions in this volume. There are several overlaps between the projects 
which all evolve around the geometric analysis of scenes from multiple 
images. All projects attempt to reconstruct the geometry and visual ap- 
pearance of complex 3D scenes that may be static or dynamic. 

While Cumuli and Vanguard deal with images from uncalibrated cam- 
eras and unrestricted camera position for general scenes. Panorama 
focusses on a highly calibrated setup used to capture 3D person mod- 
els. Cumuli and Vanguard developed techniques for handling multi- 
view relations, object tracking and camera calibration, image and geom- 
etry based view synthesis, and 3D model generation. Interaction with 
the modeled scene and mixing of virtual and real objects leads to Vir- 
tual/Augmented Reality applications in Vanguard and Cumuli, while 
the Panorama approach is tuned to fully automatic scene analysis for vi- 
sual communication and 3D-telepresence. Visualisation aspects are han- 
dled by Vanguard and Panorama with the development of auto-stereo- 
scopic displays. 



1 The Cumuli Project 

Cumuli — “Computational Understanding of MULtiple Images” — is an Es- 
prit Long term Research project focusing on multi-image geometry and its 
applications to 3D industrial metrology. It can be seen as a follow-up to the suc- 
cessful Esprit-BRA project Viva, which involved three of Cumuli’s academic 
partners: Lund University (C. Sparr), iNRiA Sophia- Antipolis (O. Faugeras) and 
Inria-Imag in Crenoble (R. Mohr). On the industrial side, CUMULi’s partners 
have expertise in image-based metrology: Innovativ Vision Image Systems in 
Linkoping are specialists in motion measurement from high speed image se- 
quences, e.g. car crash testing; and Imetric based in Courgenay (Switzerland) 
are world leaders in high precision industrial photogrammetry from still images. 
The final partner, Fraunhofer IGD in Darmstadt and Munich, specializes in 
visual modelling for augmented and virtual reality. D. Wang from the Leibniz 



Reinhard Koch, Luc Van Gool (Eds.): SMILE ’98, LNCS 1506, pp. 1-^21 1998. 
© Springer- Verlag Berlin Heidelberg 1998 



2 



R. Mohr et al. 



laboratory in Grenoble has also joined the academic team, bringing his expertise 
in algebraic proof for geometry. 

Viva provided deep insights into the geometry of 3D perception with un- 
calibrated cameras. Cumuli builds on this, considering on the one hand auto- 
calibration, Euclidean structure and extensions to more general types of image 
features, and on the other the special problems introduced by image sequences. 
It also addresses a new basic research problem: how can we automate the geo- 
metric reasoning (consistency, integration of diverse types of information, . . . ) 
required to build large 3D models from image data. 

l. 1 Multi- camera Geometry, Discrete Images 

The first part of Cumuli concentrates on the geometry of discrete sets of im- 
ages of static scenes. The goal is to recover 3D scene structure under different 
assumptions: 

— Unknown camera parameters and scenes: the basic structure of uncalibrated 
multi-camera geometry has been well studied in the past, so here we focus 
on a unification of the theory on one side, and practical numerical methods 
on the other; 

— Partial camera or scene knowledge: in many applications some prior infor- 
mation is available {e.g. camera calibrations, the knowledge that points lie 
on a plane or curve, . . . ) . Here we work on using this knowledge to extend 
the range or quality of the reconstruction, for example creating methods that 
allow reconstruction from the minimum amount of image data; 

— Non-point-like image features: to date, much of the scientific and techni- 
cal work has focused on point-like primitives. In many applications lines, 
curves, and surfaces are also important. The goal is to use the strong ge- 
ometric constraints that such primitives induce to relax the conditions for 
reconstruction, and to improve its accuracy. 

Some of our recent results improve our understanding of multi-camera geom- 
etry, for example m . Corresponding efficient reconstruction procedures have 
also been obtained, for instance |619j . 

Autocalibration is a potentially important means of moving towards Eu- 
clidean 3D structure. The basic autocalibration constraint has been reformulated 
in terms of the absolute dual quadric m and the direction frame formalisms 

m, and this has lead to much stabler numerical algorithms. Autocalibration 
has also been considered under weaker conditions, for instance when the skew 
is the only constant camera parameter Pj. A complete characterization of the 
camera motions for which autocalibration is necessarily degenerate is given in 

Considering geometric features, one major extension is the fact that silhou- 
ettes of general surfaces can be used to compute both relative positioning (Eu- 
clidean case) and projective reconstruction P]. Line features have been used in a 
fast linear algorithm for the affine camera case based on a TD camera’ (vanishing 
point to image line at infinity projection) |n|. 
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The demonstrator for this part of the project will be integrated by Imetric. 
It will use the different types of features to initialize a photogrammetric bundle 
adjustment, for 3D metrology in scenes like the one illustrated in figUl The goal 
is to allow more flexible system initialization, minimizing the the amount of prior 
information and scene instrumentation required. 




Fig. 1. A point scene of an industrial environment 



1.2 3D Perception from Continuous Image Sequences 

The second part of the project considers image sequences and non-rigid motion 
estimation. Although the same underlying theory applies to discrete and con- 
tinuous images, the practical implementations tend to be rather different. For 
one thing, the geometry inherited from the discrete case becomes degenerate 
when the views are very close or the inter-image motions are aligned (see figEJ. 
The appropriate incremental geometry must be developed and integrated with 
the handling of uncertainty, compensating the degeneracy of close views with 
the redundancy of many images. A key problem is to specialize the multi-image 
matching constraints to the continuous case. An important result is that third 
order continuous constraints are needed to reconstruct the scene and motion 
IP. Tracking tools and several efficient reconstruction methods are also being 
developed to deal with the large volume of image data. 

Innovativ Vision will integrate a demonstrator of the tools coming out of this 
part of the project. The aim is to reconstruct the non-rigid 3D motion of points 
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on a car or a test dummy during a car crash test, with as little prior calibration 
as possible. Fig. 0 shows some typical data provided by a high speed camera. 
Some camera vibration occurs, so static points in the scene will be tracked to 
provide a reference frame. 




Fig. 2. Image of a car crash sequence 



1.3 Algebraic Geometric Reasoning from Image Data 

The final part of Cumuli addresses a new area: the use of automated algebraic 
and geometric reasoning tools in visual reconstruction. The geometric models in 
computer vision are more than just coordinates: they are complex networks of 
incidence relations, constraints, etc. To build such models from weak or uncertain 
data, we need methods that make efficient use of any known constraints on the 
environment, to reduce the uncertainty and improve the consistency and quality 
of the generated models. Often, many such constraints are available. They are 
often used informally when building models by hand, but it is currently difficult 
to automate this process. In particular, we are studying the integration of digital 
site maps with vision data, as these are an important source of constraints in 
urban modelling (see The Use of Reality Models for Augmented Reality Applica- 
tions). 

Another main focus is the use of automated geometric reasoning techniques 
for computer vision applications. However, although the need to manipulate ab- 
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stract geometric knowledge (the constraints) strongly motivates their use, they 
must be adapted to the particular requirements of computer vision. One illustra- 
tion of this approach is the automatic derivation of minimal parametrizations of 
constrained 3D geometric models (see Imposing Euclidean Constraints during the 
Self- Calibration Process). This is a crucial step towards enforcing the constraints, 
but given the intrinsic uncertainty and the potential complexity of the 3D mod- 
els involved in computer vision (many primitives, complicated constraints), to 
actually impose the constraints the algebraic techniques need to be coupled to 
efficient numerical methods (see Euclidean and Affine Structure/ Motion for Un- 
calihrated Cameras from Affine Shape and Subsidary Information ). 



2 The Panorama Project 

This paper gives an overview over the ACTS Panorama project and addi- 
tionally a more detailed look on the 3-D reconstruction approach developed in 
Panorama. The Panorama consortium consists of 14 European partners from 
Universities, Research Institutes and Industry. The objective of the Panorama 
project is to enhance the visual information exchange in telecommunications 
with 3-D telepresence. The main challenges of Panorama are: 

— Multiview camera and autostereoscopic display which allows for movement 
of the observer (multiviewpoint capability) 

— Realtime imaging system using special purpose hardware for analysis, vector 
coding and interpolation of intermediate views 

— Imaging system based on offline image analysis using 3D model objects and 
state-of-the-art 3D graphic computers 

— Application studies in field trials 

2.1 Goals of Panorama 

The ultimate goal for future telecommunication is highly effective interper- 
sonal information exchange. The effectiveness of telecommunication is greatly 
enhanced by 3-D telepresence. In this concept it is crucial that visual infor- 
mation is presented in such a way that the viewer is under the impression of 
actually being physically close to the party with whom the communication takes 
place. Existing systems realise 3-D telepresence by stereoscopic imaging and dis- 
play technologies. A natural 3-D impression is achieved if the camera positions 
correspond to the observers eye positions, and if the observer is stationary. 

The objective of the Panorama project is to enhance the visual information 
exchange in telecommunication by alleviating two major drawbacks of existing 3- 
D telepresence systems. In the first place, an autostereoscopic display is realised 
that spatially separates the left and right view images according to the observers 
eye positions instead of the current techniques which require wearing polarised or 
active LCD glasses. Secondly, the communication system allows for movements 
of the observer while providing a 3-D view, which enables to look around objects. 
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In order to look around objects, intermediate images are synthesised at the 
receiver side appropriate to the observers head position. The synthesis of in- 
termediate views uses the captured trinocular image sequences and 3-D scene 
information that is obtained my means of image analysis. Two different image 
synthesis approaches are considered in Panorama. A high-quality but complex 
off-line system based on 3-D reconstruction of dynamic scenes is realised that 
uses explicit 3-D models. With this approach, synthesis of intermediate images is 
carried out on-line using existing 3-D computer graphics hardware. On the other 
hand a real-time system is developed which is composed of disparity estimator, 
disparity field, video and audio codec for transmission over Dutch and German 
National Hosts and disparity compensated image synthesis of intermediate views. 

The Panorama project demonstrates its achievements by application stud- 
ies which are realised by demonstrator, prototype subsystems or services. Two 
demonstrators for video communication systems are currently developed, one for 
each of the investigated image synthesis approaches. Since also other 3-D imag- 
ing applications can greatly benefit from the developed technologies, three addi- 
tional application studies are performed in PANORAMA in the fields of medicine, 
automation in 3-D modelling of industrial environments, and 3-D program pro- 
duction. 

2.2 3-D Reconstruction Approach in PANORAMA 

One of the two image synthesis approaches developed in Panorama for the 
generation of intermediate views is based on 3-D reconstruction and 3-D com- 
puter graphics. This approach uses a calibrated trinocular camera setup which is 
arranged around the autostereoscopic screen showing the other communication 
party (figEI). 




observer 



autostereoscopic 
screen 



camera 



Visualised 3-D model 
of the communication 
partner 



Fig. 3. Trinocular camera setup for video communication application 



The camera triple is calibrated once after its installation by determining the 
relative extrinsic and intrinsic camera parameters. An accurate multi-camera 
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calibration approach has been developed in PANORAMA El which is based on the 
Tsai calibration m- It exploits geometric constraints of the trinocular camera 
setup in order to manage calibration with a relatively simple planar calibration 
pattern. 

The developed 3-D reconstruction approach for dynamic scenes represents 
the scene using explicit 3-D models, which are defined by 3-D shape, surface 
colour and 3-D motion parameters. This kind of representation is known from 
computer graphics EC3- In 3-D reconstruction, additional data like observa- 
tion points and a data memory for information from preceding time instances 
is required. Thus, PANORAMA developed a new 3-D scene representation that 
supports 3-D reconstruction and is still compatible to computer graphics. This 
scene representation is currently implemented in more than 100 C-|— I- classes. 

The 3-D reconstruction can be subdivided into two phases: the initialisation 
phase, where new objects occurring in the scene are analysed, and an update 
phase, where temporal changes of already reconstructed objects are analysed 
and the quality of 3-D models is successively improved. 

In the initialisation phase, a set of depth maps and 3-D edges are estimated 
from the input image triples El- The depth maps are back-projected into 3-D 
space, resulting in a cloud of 3-D points. These points are interpolated together 
with the 3-D edges into a coherent surface model. This model is further approxi- 
mated using a triangular mesh that represents the shape of the objects The 
texture of the objects is represented by mapping parts of the image triple onto 
each triangle El 

The update phase starts by analysing the motion of the objects, assuming 
that most of the changes result from motion of rigid objects PHTTH . In case of 
persons, the assumption of rigidity requires a subdivision of the 3-D models into 
articulated components. This subdivision is based on an evaluation of local 3-D 
motion information which is estimated in a robust manner for each triangle j‘21)j . 
After compensating the motion of the object components, the shape of the 3-D 
models is updated. This update considers fiexible deformations inside the models 
and evaluates the changes at object borders El- Finally the texture of the 3-D 
models is updated, compensating the changes of the objects surface colour. The 
resulting 3-D models are stored in order to analyse the next image triple and 
are transmitted to the receiver (see Improving Block-based Disparity Estimation 
by Considering the non-uniform Distribution of the Estimation Error, Multi- 
Camera Aequisitions for High-Accuracy 3D Reconstruction, and Integration of 
Multiple Range Maps through Consistency Processing). 



2.3 Telepresence Visualization 

At the receiver, two views of the scene are visualised on an autostereoscopic dis- 
play according to the observers head position. Two techniques for a visualisation 
of the 3-D models were developed. On the one hand, interfaces to common com- 
puter graphics file formats like VRMLl, VRML2 and Openinventor were built 
which enable the visualisation with commercially available viewers. Since these 
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file formats provide only limited representations of temporal changes, addition- 
ally a new viewer for direct visualisation of Panorama 3-D scene representation 
and a file format based on OpenGL was developed. With this viewer, also shape 
and texture updates can be easily visualised. 

A distributed development of a common 3-D reconstruction software is per- 
formed in Panorama. The software is integrated and tested at one site to guar- 
antee the interfacing and functionality. The overall approach is tested using im- 
age sequences of typical video communication scenes. The tests proved to provide 
realistically looking synthesised images of the other communication party for rea- 
sonable observer movements. An example of such a synthesised image is given in 
figEfleft). Due to the fixed camera positions that show the other communication 
party from the front side, the 3-D models are not closed at their back. Therefore, 
the observer should not move his head outside the area that is covered by the 
real camera triple. 




Fig. 4. (left) Image synthesised from automatically generated 3-D model, (right) 
Seamless integration of an automatically generated 3-D model into a virtual 
environment 



Fig . Enright) shows that the automatically generated 3-D models can be seam- 
lessly integrated into virtual environments. This allows an integration of 3-D 
telecommunication into 3-D applications with communication requirements like 
e.g. 3-D teleshopping or cooperative CAD. Beyond that the seamless integration 
of automatically generated 3-D models and virtual environments will allow to 
design comfortable and intuitive teleconference systems where participants are 
brought together in a 3-D virtual conference room and to give the participants 
the impression of being physically close to each other. 

The project itself and a number of recent results are also documented online 
at 

http://www.tnt.uni-hannover.de/project/eu/panorama/ 
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3 The Vanguard Project 

Vanguard stands for Visualisation Across Networks using Graphics and Un- 
calibrated Acquisition of Real Data. The expertise of the consortium is in the 
fields of uncalibrated tracking and 3D reconstruction (University of Oxford, 
K.U. Leuven), image-based view synthesis and mosaicing (University of Jeru- 
salem), augmented reality systems (IGD Darmstadt) and 3D displays (SHARP 
Laboraties Europe). 

The primary goal of this project is automatic 3D model building from video 
sequences, and the use of these models for the rendering of scenes in telepresence 
applications. The scenes may contain both real and synthetic objects. This goal 
entails extracting both object geometry and surface descriptions (reflectance) at 
a level suitable for high quality graphical rendering. 

3.1 Approach 

Two key points underlie our approach: 

— The camera motion is unknown and unconstrained, 

— The camera internal parameters (e.g. focal length) are unknown. 

Here, the goal is to arrive at 3D models from multiple images taken without 
any prior calibration of the internal or external camera parameters. 

These points are essential for flexible and general purpose model acquisition 
because camera motion and calibration are usually not recorded (think of ex- 
tracting 3D models of buildings from old newsreel footage, or from an unknown 
image sequence on the net). 

The 3D geometrical models, together with appropriate surface information 
facilitate: 

— Graphical rendering of novel views, and sequences of views, of the scenes. 

— Integration of synthetic and real models and sequences. 

The basic Vanguard modules and the interaction between vision and graph- 
ics are sketched in figO The main modules are: 

Computer Vision This module handles all computer vision tasks such as tracking 
of features, self-calibration, and surface reconstruction |22ES|. It forms the back- 
bone of the Vanguard technology (see the contributions Metric 3D Surface Re- 
construction from Uncalibrated Image Sequences^ Automatic 3D Model Construc- 
tion for Turn-Table Sequences^ and Matching and Reconstruction from Widely 
Separated Views) . Another important direction here is the development of image 
based view synthesis methods ^S]and of panoramic image representations[25- 

3D Geometry The geometrical models as produced by the vision tasks need 
post processing like mesh retriangulation and reduction I2SI, compact represen- 
tation, model fusion and editing (see Fitting Geometrical Deformable Models to 
Registered Range Images). 
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Interactive 3D graphics 



Tracking and Display Technologies 



Distributed computing 



Fig. 5. Schematic overview of Vanguard modules. 



Surface properties In addition to the geometric properties Vanguard aims at 
extracting surface albedo, reflectance and texture properties This information is 
used to control the rendering and to further increase the realism of the generated 
models. 

3.2 Applications 

All modules as mentioned above are tuned towards a set of applications that will 
use the extracted real-world models and enhance them with AR/VR techniques. 
The project has focussed on three major application areas: 

3D surface modeling from an image sequence The goal of this very general appli- 
cation is to develop strategies for easy and efficient geometric modeling. Based 
on images of the scene alone highly realistic metric models of specific objects 
or scenes (e.g. buildings, landscapes) are generated that can be used for visual- 
isation and demonstration purposes. As a test case for this approach, a recon- 
struction of parts of the archaeological excavation site in Sagalassos, Turkey, was 
performed. An example of 3D modeling shows the reconstruction of a corner of 
the Roman bath at the Sagalassos site, based on five uncalibrated views (fig. El)- 

Stereo Visualisations of Objects and Scenes Given a monocular (single camera) 
sequence, the objective is to generate the image sequences that would have been 
seen had a stereo-rig moved along the same path. The stereo sequence will be used 
as an image source to drive 3D displays (e.g. shutter glasses on a workstation, and 
sharp’s autostereoscopic displays). By creating views that are “in-between” 



Cumuli, Panorama, and Vanguard Project Overview 



11 






Fig. 6. 2 Images of Roman bath sequence, reconstructed shape and textured 
model 



the discrete set of images, an active “look-around” system can be built — a 
pseudo-holographic display. 

Collaborative Scene Visualisation The objective of this application is to walk 
around objects with a video camera and then extract a 3D model. The objects 
will then be graphically rendered together with possible arrangements of previ- 
ously modeled (e.g. CAD) objects and surroundings. Remote sites will be able to 
jointly change the configurations of real and virtual objects. The application will 
serve as a trial using ISDN or ATM (see Applying Augmented Reality Techniques 
in the Field of Interactive Collaborative Design). 

These applications represent a significant integration of state-of-the-art com- 
puter vision and computer graphics techniques. Many other application areas 
could benefit from these capabilities: non-invasive surgery; heritage preserva- 
tion; archeologists recording layouts and discovered artefacts at an excavation 
(a video sequence at each stage providing the basis for subsequent virtual real- 
ity walkthroughs), accident investigators recording a scene before it is cleared; 
architecture visualization; collaborative design; teleshopping, etc. 

Walkthrough, look-around and model acquisition also have applications in 
the games industry, e.g. to create more realistic backgrounds than simple graph- 
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ically rendered ones, to immerse the player more fully (by changing the scene 
when his/her viewing perspective changes), and to provide virtual objects. Fi- 
nally, although not explored here, the work opens up the prospect of intelligent 
keying - the integration of two video sequences - based on the geometries of 
the viewed scenes; it also allows virtual “tele-porting” of objects, since a cloned 
model can be cut at a distant location using the geometry and surface description 
provided by a video of the original. 
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Abstract. It has been known since the work of Carlsson [2| and Wein- 
shall HH that there is a dualization principle that allows one to inter- 
change the role of points being viewed by several cameras and the camera 
centres themselves. In principle this implies the possibility of dualizing 
projective reconstruction algorithms to obtain new algorithms. In this 
paper, this theme is developed at a theoretical and algorithmic level. 
The nature of the duality mapping is explored and its application to 
reconstruction ambiguity is discussed. An explicit method for dualizing 
any projective reconstruction algorithm is given. At the practical imple- 
mentation level, however, it is shown that there are difficulties which 
have so far defeated successful application of this dualization method to 
produce working algorithms. 



1 Introduction 

The theory and practice of projective and metric reconstruction from uncali- 
brated and semi calibrated views has reached a level of maturity in recent years 
that excellent results may now be achieved. Papers presented at this workshop 
and reported in this volume show the high quality of reconstruction that is now 
possible. 

In particular, it would appear that many of the problems of reconstruction 
have now reached a level where one may claim that they are solved. Such prob- 
lems include 

1. Computation of the multifocal tensors, particularly the fundamental matrix 
and trifocal tensors (the quadrifocal tensor having not received so much 
attention) [TDl Hj. 

2. Extraction of the camera matrices from these tensors, and subsequent pro- 
jective reconstruction from two and three views. 

Other significant successes have been achieved, though there may be more to 
learn about these problems. 

1. Application of bundle adjustment to solve more general reconstruction prob- 
lems. 

2. Metric (Euclidean) reconstruction given minimal assumptions on the camera 
matrices (13 13 Hi). 
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3. Automatic detection of correspondences in image sequences, and elimination 
of outliers and false matches using the multifocal tensor relationships muHi. 

In other areas the last word has clearly not been written. Notably, there is not 
any single satisfactory algorithm for projective reconstruction from several views. 
Many methods have been tried : iterative methods, methods based on tacking 
together reconstructions from small numbers of views or factorization-based 
algorithms [El El) which need arbitrary guesses at depth. 

This paper discusses a technique that, although known, seems not to have 
received as much attention as may be warranted. The method based on a du- 
alization principle expounded by Carlsson and also Weinshall (0 C3) can in 
principle transform the problem of projective reconstruction from long image 
sequences into the problem of projective reconstruction from small numbers of 
views, for which (as claimed above) the reconstruction problem is essentially 
solved. It is shown that although this duality theoretically gives rise to the de- 
sired multiple-view algorithms, in reality there are practical difficulties. In this 
paper, the problem of how to obtain working algorithms from this method is 
not solved. The purpose is to highlight the fascinating properties of the duality 
method, here called Carlsson duality, with the hope of awaking enough interest 
to lead to a practical implementation of these methods. 

Before we proceed to discuss duality, I claim the privilege of giving an opin- 
ion. At this point of maturity, the understanding of the underlying geometrical 
properties of multi-view vision and the implementation of high-quality geomet- 
rical algorithms have outstripped the less mathematically structured tasks of 
correspondence matching and 3D model building that are essential to build- 
ing a good system (despite the excellent results achieved and reported at the 
workshop). In short, we seem to be able to obtain small robust sets of image 
correspondences and reconstruct these points in 3-space. But how does one find 
sufficiently many correspondences to build a complete model, and anyway, how 
does one build a complete 3D model, that is, fill in the gaps between the points ? 
We can still not do satisfactory automatic reconstruction from complex outdoor 
scenes (for instance a forest scene) or even indoor scenes such as a room with 
a jumble of furniture and equipment (such as my office). However, leaving for 
another day a consideration of these harder problem, we now turn to the main 
technical subject of this paper. 

2 Carlsson Duality 

Let El = (1, 0, 0, 0)T, E2 = (0, 1, 0, 0)^, Eg = (0, 0, 1, 0)^ and E4 = (0, 0, 0, 1)^ 
form part of a projective basis for Similarly, let ei = (1, 0, 0)^ 62 = (0, 1, 0)^ 
63 = (0, 0, 1)^ e4 = (1, 1, 1)^ be a projective basis for the projective image plane 

Now, consider a camera with matrix P. We assume that the camera centre C 
does not sit on any of the axial planes, that is C = (x, Y, Z, t)^ and none of the 
four coordinates is zero. In this case, no three of the points PE^ for f = I, . . . , 4 are 
collinear in the image. Consequently, one may apply a projective transformation 
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H to the image so that = HPEi. We assume that this has been done, and 
henceforth denote HP simply by P. Since PE^ = e^, one computes that the form 
of the matrix P is 



P = 




( 1 ) 



Further, the camera centre is C = (a, /3, 7 , 5)^, as one verifies by solving PC = 0. 
If C = (a, (3, 7 , is any point in V^, then the matrix in (PJ will be denoted by 
Pc. 

Now, for any point X = (x, Y, Z, x)^ one verifies that 



/ a“^X - 

PcX = I P~^Y - S~^T I . (2) 

y 7“^z - (5 “^t J 



This observation leads to the following definition 
Definition 1. The mapping ofV^ to itself given by 

(x, Y, z, t)^ I— > (yzt, ztx, txy, xyz)^ 



will be called the Carlsson map, and will be denoted by T. We denote the image 
of a point X under T by X.. The image of an object under T is sometimes referred 
to as the dual object, for reasons that will be seen later. 

The Carlsson map is an example of a Cremona transformation. For more 
information on Cremona transformations, the reader is referred to Semple and 
Kneebone (HH). 

Note. If none of the coordinates of X is zero then we may divide X by xyzt. 
Then T is equivalent to (x,Y, Z, x)^ 1 — > (x“^, Y“^, Z“^, x“^)^. This is the form 
of the mapping that we will usually use. In the case where one of the coordinates 
of X is zero, then the mapping will be interpreted as in the definition. Note that 
any point (0,Y, z,x)^ is mapped to the point (1,0, 0,0)^ by T, provided none 
of the other coordinates is zero. Thus, the mapping is not one-to-one. 

If two of the coordinates of X are zero, then X = (0,0, 0,0)^, which is an 
undefined point. Thus, T is not defined at all points. In fact, there is no way to 
extend T continuously to such points. Note that the points for which the mapping 
is undefined consists of the lines joining two of the points E^. We will call the four 
points Ei the vertices of the reference tetrahedron. The lines joining two vertices 
are the edges of the tetrahedron, and the planes defined by three vertices are 
the faces of the reference tetrahedron. As remarked, T is undefined on the edges 
of the reference tetrahedron. As for the faces of the reference tetrahedron, these 
are the points with a zero coordinate. Consequently (as shown above), each face 
is mapped by T to a single point, namely the opposite vertex of the reference 
tetrahedron. 
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The major importance of the Carlsson map derives from the following for- 
mula, which is easily derived from (PI). 



PcX = P^C (3) 

Thus, r interchanges the roles of object points and camera centres. Thus, C 
acting on X gives the same result as X acting on C. The consequences of this 
result will be investigated soon. However, first we will investigate the way in 
which r acts on other geometric objects. 

Theorem (2.0.1). The Carlsson map, T acts in the following manner : 

1. It maps a line passing through two general points Xq and Xi to the twisted cu- 
bic (m passing through Xq, Xi and the the four reference vertices Ei , . . . , E4. 

2. It maps a line passing through any of the points Ei to a line passing through 
the same E^. We exclude the lines lying on the face of the reference tetrahe- 
dron, since such lines will be mapped to a single point. 

3. It maps a quadric Q passing through the four points Ei, i = 1,...4 to a 
quadric surface (denoted Qj passing through the same four points. If Q is a 
ruled quadric, then so is Q. If Q is degenerate then so is Q. 

Proof. Part 1. A line has parametric equation (Xo-l-a0, Yo-\-b9, Zo-\-c9, To-l-d0)^, 
and a point on this line is taken by the Carlsson map to the point 

((yq -I- 60)(Zq -I- c0)(tq -I- d9), . . . , (xq -I- a9)(YQ -\- b9)(ZQ -\- c9)) ^ . 

Thus, the entries of the vector are cubic functions of 9, and the curve is a 
twisted cubic. Now, setting 9 = — Xq/o, the term (xq -P a9) vanishes, and the 
corresponding dual point is ((yq -P b9){Zo + c9){to + d9),0, 0, 0) ^ « (1, 0, 0, 0)^. 
The first entry is the only one that does not contain (xg-Pa^), and hence the only 
one that does not vanish. This shows that the reference vertex Ei = (1, 0, 0, 0)^ 
is on the twisted cubic. By similar arguments, the other points E2 , . . . , E4 lie on 
the twisted cubic also. Note that a twisted cubic is defined by 6 points, and this 
twisted cubic is defined by the given 6 points E^, Xq, Xi that lie on it, where Xq 
and Xi are any two points defining the line. 

Part 2. We prove this for lines passing through the point Ei = (1,0, 0,0)^. 
An analogous proof holds for the other points E^. Choose another point X = 
(x,Y, Z,t)^ on the line, such that X does not lie on any face of the reference 
tetrahedron. Thus X has no zero coordinate. Points on a line passing through 
(1,0, 0,0)^ and X = (x,Y, Z, t)^ are all of the form (x, Y, z, t)^ -P fc(l, 0, 0, 0)^ = 
(q;,y, Z,t)^ for varying values of a = x -P fc. These points are mapped by the 
transformation to (a“^, Y“^, Z“^, This represents a line passing through 

the two points (1, 0, 0, 0)^ and X = (x“^, Y“^, z“^, 

Part 3. Since the quadric Q passes through all the points E^, the diagonal 
entries of Q must all be zero. This means that there are no terms involving 
a squared coordinate (such as X^) in the equation for the quadric. Hence the 
equation for the quadric contains only mixed terms (such as XY, YZ or xx). 
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Therefore, a point X = (x, Y, Z, t)^ lies on the quadric Q if and only if aXY + 
bxz + cXT + dvz + eYT + /zt = 0. Dividing this equation by XYZT, we obtain 
aZ“^T“^ + + CY“^Z“^ + dx“^T“^ + ex“^Z“^ + fx~^Y~^ = 0. Since 

X = (x“^, Y“^, z“^, this is a quadratic equation in the entries of X. 

Thus r maps quadric to quadric. Specifically, suppose Q is represented by the 
matrix 



0 a b c 




0 f e d 


a 0 d e 


then Q = 


f 0 cb 


b d 0 f 


e c 0 a 


_c e / 0_ 




d b a 0 



and X^QX = 0 implies X^QX = 0. The quadric Q is a ruled quadric, since the 
generators of Q passing through the points map to straight lines, lying on 
Q. One may further verify that det Q = det Q, which implies that if Q is a non- 
degenerate quadric (that is det Q ^ 0), then so is Q. In this non-degenerate case, 
if Q is a hyperboloid of one sheet, then det Q > 0, from which it follows that 
det Q > 0. Thus Q is also a hyperboloid of one sheet. □ 

We wish to interpret duality equation m in a coordinate-free manner. The 
matrix Pc has by definition the form given in m, and maps E^ to for 
i = 1,...,4. The image PcX is may be thought of as a representation of 
the projection of X relative to the projective basis in the image. Alterna- 
tively, PcX represents the projective equivalence class of the set of the five rays 
CEi, . . . , CE4, CX. Thus PcX = Pc'X' if and only if the set of rays from C to 
X and the four vertices of the reference tetrahedron is projectively equivalent to 
the set of rays from C' to X' and the four reference vertices. 

The Duality Principle. 

There is nothing special about the four points Ei,...,E 4 used as vertices of 
the reference tetrahedron, other than the fact that they are non-coplanar. Given 
any four non-coplanar points, one may define a projective coordinate system in 
which these four points are the points E^ forming part of a projective basis. The 
Carlsson mapping may then be defined with respect to this coordinate frame. 
The resulting map is called the Carlsson map with respect to the given reference 
tetrahedron. 

To be more precise, it should be observed that five points (not four) define 
a projective coordinate frame in V^. In fact, there is a 3-parameter family of 
projective frames for which four non-coplanar points have coordinates E^. Thus 
the Carlsson map with respect to a given reference tetrahedron is not unique. 
However, the mapping given by definition © with respect to any such coordinate 
frame may be used. 

Given a statement or theorem concerning projections of sets of points with 
respect to one or more projection centres one may derive a dual statement. One 
requires that among the four points being projected, there are four non-coplanar 
points that may form a reference tetrahedron. Under a general duality mapping 
with respect to the reference tetrahedron 
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1. Points (other than those belonging to the reference tetrahedron) are mapped 
to centres of projection. 

2. Centres of projection are mapped to points. 

3. Straight lines are mapped to twisted cubics. 

4. Ruled quadrics containing the reference tetrahedron are mapped to ruled 
quadrics containing the reference tetrahedron. 

Points lying on an edge of the reference tetrahedron should be avoided, since the 
Carlsson mapping is undefined for such points. Using this as a sort of translation 
table, one may use existing theorems about point projection to be dualized, 
giving new theorems for which a separate proof is not needed. 

Note : It is important to observe that only those points not belonging to the 
reference tetrahedron are mapped to camera centres by duality. The vertices 
of the reference tetrahedron remain points. In practice, in applying the duality 
principle, one may select any 4 points to form the reference tetrahedron, as long 
as they are non-coplanar. In general, in the results stated in the next section there 
will be an assumption (not always stated explicitly) that point sets considered 
contain four non-coplanar points, which may be taken as the reference 
tetrahedron. 



2.1 Reconstruction Ambiguity 

It will be shown in this section how various ambiguous reconstruction results may 
be derived simply from known, or obvious geometrical statements by applying 
duality. 

We will be considering configurations of camera centres and 3D points, which 
will be denoted by {Ci, . . . , Cm,; Xi, . . . , X„} or variations thereof. Implicit is 
that the symbols appearing before the semicolon are camera centres, and those 
that come after are 3D points. In order to make the statements of derived results 
simple, the concept of image equivalence is defined. 

Definition 2. Two configurations 

{Cl, . . . C„; Xi, . . . X„} and {C[, . . . C(„; X'l, . . . X'„} 

are called image equivalent if for all i the image of the set of points Xi, . . . , X„ 
observed from camera centre Ci is projectively equivalent to the image of points 
X'l, ... , X(j observed from C'. 

This definition makes sense, only because an image is determined up to 
projective equivalence by the centre of projection. The image of the points 
Xi, . . . , X„ with respect to centre Ci may be thought of somewhat abstractly as 
the projective equivalence class of the set of rays {CiX.j : j = 1, . ■ ■ , n}. 

The concept of image equivalence is distinct from projective equivalence of 
the sets of points and camera centres involved. Indeed, the relevance of this 
to reconstruction ambiguity is that if a configuration (Ci, . . . , Cm', Xi,...,X„} 
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allows another image-equivalent set which is not projective-equivalent, then this 
amounts to an ambiguity of the projective reconstruction problem, since the 
projective structure of the points and cameras is not uniquely defined by the set 
of images. In this case, we say that the configuration {Ci, . . . , C^, Xi, . . . X„} 
allows an alternative reconstruction. 



Single View Ambiguity 

As a simple example of what can be deduced using Carlsson duality, consider 
the following simple question : when do two points project to the same point 
in an image. The answer is obviously, when the two points lie on the same ray 
(straight line) through the camera centre. From this simple observation, one may 
deduce the following result. 

(2.1.2). Consider a set of camera centres Ci, . . . , o,nd a point Xq all lying 
on a single straight line, and let Ei : i = 1, ... ,4 be the vertices of a reference 
tetrahedron. Let X be another point. The the two configurations 

(Ci, . . . , El; • ■ • ; E4 , X} and {C4, . . . , C^, E4, . . . , E4, Xq} 

are image- equivalent configurations if and only if X lies on the same straight 
line. 



This is illustrated in FiglB 

In passing to the dual statement, according to Theorem (2.0.1) the straight 
line becomes a twisted cubic through the four vertices of the reference tetrahe- 
dron. Thus the dual statement to ( (2.1.2) ) is : 



(2.1.3). Consider a set of points X^ and a camera centre Cq all lying on a single 
twisted cubic also passing through four reference vertices E^. Let C be any other 
camera centre. Then the configurations 



{C; El, . . . , E4, Xi, . . . , X„} and {Cq; Ei, 



, E4, Xi 



,X„} 



are image equivalent if and only if C lies on the same twisted cubic. 

Since the points E^ may be any four non-coplanar points, and a twisted cubic 
can not contain 4 coplanar points, one may state this last result in the following 
form : 



Proposition 1. Let Xi, . . . ,Xm be a set of points and Cq a camera centre all 
lying on a twisted cubic. Then for any other camera centre C the configurations 

{C; Xi, . . . , X„} and {Cq; Xi, . . . , X„} 

are image equivalent if and only if C lies on the same twisted cubic. 
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This is illustrated in Fig |21 It shows that camera pose can not be uniquely 
determined whenever all points and a camera centre lie on a twisted cubic. 

Using similar methods one can show that this is one of only two possible am- 
biguous situations. The other case in which ambiguity occurs is when all points 
and the two camera centres lie in the union of a plane and a line. This arises as 
the dual of the case when the straight line through the camera centres meets one 
of the vertices of the reference tetrahedron. In this case, the dual of this line is 
also a straight line through the same reference vertex (see Theorem (2.0.1) ), and 
all points must lie on this line or the opposite face of the reference tetrahedron. 
These results were brought to the attention of the computer-vision community 
by Buchanan (Q). 




Fig. 1. Left : Any point on the line passing through C and X is projected to 
the same point from projection centre C. Right : The dual statement - from 
any centre of projection C lying on a twisted cubic passing through X and the 
vertices of the reference tetrahedron, the five points are projected in the same 
way (up to projective equivalence). Thus a camera is constrained to lie on a 
twisted cubic by its image of five known points. 



The Horopter 

Similar arguments can be used to derive the form of the horopter for two images. 
The horopter is the set of space points that map to the same point in two images. 
The following result is self-evident. 

(2.1.4). Given points X and X', the set of camera centres C such that 
{C; El, . . . , E4, X} and {C; Ei, . . . , E4, X'} 
are image equivalent is the straight line passing through X and X'. 

This is illustrated in Figia The dual of this statement is 

Proposition 2. Given projection centres C and C' , non-collinear with the four 
points Ei of a reference tetrahedron, the set of points X such that {C; Ei, . . . , E4, 
X} and |C'; Ei, . . . , E4, X} are image-equivalent is a twisted cubic passing through 
El, . . . , E 4 and the two projection centres C and C' . 
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Fig. 2. Left : From any centre of projection C, C', . . . lying on the line passing 
through X and X', the points X and X' are projected to the same ray. That 
is, {C;Ei,X} is image-equivalent to {C;Ei,X'} for all C on the line. Right : 
The dual statement - all points on the twisted cubic passing through C and 
C! and the vertices of the reference tetrahedron are projected in the same way 
relative to the two projection centres. That is, {C; E^, X} is image-equivalent to 
{C'; Ei, X} for all X on the twisted cubic. This curve is called the horopter for 
the two centres of projection. 



Note in both these examples how the use of duality has taken intuitively obvi- 
ous statements concerning projections of collinear points and derived a result 
somewhat less obvious about points lying on a twisted cubic. 



Two- View Ambiguity 

The basic (0) result about critical surfaces from two views may be stated as 
follows. 

Theorem (2.1.5). A configuration {Ci, C 2 ; Xi, . . . , X„} of two camera centres 
and n points allows an alternative reconstruction if and only if both camera 
centres Ci, C 2 and all the points Xj lie on a ruled quadric surface. Furthermore, 
when an alternative reconstruction exists, then there will always exist a third 
distinct reconstruction. 

One may write down the dual statement straight away as follows. 

Theorem (2.1.6). A configuration {Ci, . . . , C„; Xi, . . . , Xg} of any number of 
cameras and six points allows an alternative reconstruction if and only if all cam- 
era centres Ci, . . . , C„ and all the points Xi,Xg lie on a ruled quadric surface. 
Furthermore, when an alternative reconstruction exists, then there will always 
exist a third distinct reconstruction. 

This result was proven in ng. Observe that in this dual statement, the value 
of n is not the same as the value of n in Theorem |( 2 . 1 . 5 )| Indeed, in the transition 
to the dual result, four of the original n points X^ are selected as the reference 
tetrahedron, and remain points. The remaining n — 4 points become camera 
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centres. The two original camera centres become points, making six points in 
total. The ruled quadric becomes a ruled quadric according to Theorem Em 
The minimum interesting case of Theorem|(2.1.6)|is when n = 3, as studied in 
m- In this case one has nine points in total (three cameras and six points) . One 
can construct a quadric surface passing through these nine points ( a quadric 
is defined by nine points). If the quadric is a ruled quadric (a hyperboloid of 
one sheet in the non-degenerate case), then there are three possible distinct 
reconstructions. Otherwise the reconstruction is unique. 

3 Dual Algorithms 

The method of duality will now be given for deriving a dual algorithm from a 
given algorithm. Specifically, it will be shown that if one has an algorithm for 
doing projective reconstruction from n views of m -I- 4 points, then there is an 
algorithm for doing projective reconstruction from m views of n-|-4 points. This 
result, observed by Carlsson 0, will be made specific by explicitly describing 
the steps of the dual algorithm. 

We consider a projective reconstruction problem, which will be referred to 
as V{m, n). It is the problem of doing reconstruction from m views of n points. 
We denote image points by x®, which represents the image of the j-th object 
space point in the *-th view. Thus, the upper index indicates the view number, 
and the lower index represents the point number. Such a set of points {x*} is 
called realizable if there are a set of camera matrices P® and a set of 3D points 
Xj such that x® = P®Xj. The projective reconstruction problem V{m,n) is that 
of finding such camera matrices P® and points X_,- given a realizable set {x®} for 
m views of n points. 

Let A(n^m + A) represent an algorithm for solving the projective reconstruc- 
tion problem V{n, m+4). An algorithm will now be exhibited for solving the pro- 
jective reconstruction V{m,n + 4:). This algorithm will be denoted A*(m,n + 4), 
the dual of the algorithm A{n, m + 4). 

Initially, the steps of the algorithm will be given without proof. In addition, 
difficulties will be glossed over so as to give the general idea without getting 
bogged down in details. In the description of this algorithm it is important to 
keep track of the range of the indices, and whether they index the cameras or 
the points. Thus, the following may help to keep track. 

— Upper indices represent the view number. 

— Lower indices represent the point number. 

— i ranges from 1 to m. 

— j ranges from 1 to n. 

— k ranges from 1 to 4. 

The Dual Algorithm 

Given an algorithm A{n, m-|-4) the goal is to exhibit a dual algorithm A*{m, n + 

4). 
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Input: 

The input to the algorithm A*{m, n+4) consists of a realizable set of n+4 points 
seen in m views. This set of points can be arranged in a table as in Fig|3^1eft). 



Views (0 



m 




Views (0 

m ► 



Points (/) 





T'l 


x'i 




X 1 




X'l 


X'i 




^ 2 
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■y'm 
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a 




ex 




ex 


4 


^2 


ei 




ei 






e3 




e3 


V 


e4 


64 




64 




t 

T' 


t 

T" • 


t 

• T' • 


t 



Fig. 3. Left : Input to algorithm .4*(TO,n + 4) Right : Input data after 
transformation. 



In this table, the points are separated from the other points x*, since 

they will receive special treatment. 



Step 1 : Transform. 

The first step is to compute for each i, a transformation T* that maps the points 
^h+k> ^ = 1) • ■ • )4 in the Tth view to the points Bk of a canonical basis for 
projective 2-space V^. The transformation T® is applied also to each of the points 
X* to produce transformed points x'* = T®x*. The result is the transformed 
point array shown in Fig fright). A different transformation T® is computed 
and applied to each column of the array, as indicated. 



Step 2 : Transpose. 

The last four rows of the array are dropped, and the remaining block of the 
array is transposed. One defines x^ = x'®. At the same time, one does a mental 
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switch of points and views. Thus the point xj is now conceived as being the 
image of the j-th point in the *-th view, whereas the point xj was the image of 
the i-th point in the j-th view. What is happening here effectively is that one is 
swapping the roles of points and cameras - the basic concept behind Carlsson 
duality expressed by 0 . The resulting transposed array is shown in Fig01eft). 



Points (0 



Views (/') 



n 




Views (/) 



^ n ► 


xi 




Xi 


•^2 




^2 




V/ 




x' 

yvm 


^2 

xi 


■^m 




ei 


ei 


ei 


^2 . 








«3 


P4 


P4 


P4 



Fig. 4. Left : Transposed data. Right : Transposed data extended by addition 
of extra points. 



Step 3 : Extend. 

The array of points is now extended by the addition of four extra rows contain- 
ing points Bk in all positions of the (m + fc)-th row of the array, as shown in 
Fig0(right). 

Step 4 : Solve. 

The array of points resulting from the last step has m -I- 4 rows and n columns, 
and may be regarded as the positions of m-|-4 points seen in n views. As such, it 
is a candidate for solution by the algorithm A(n, m-|-4), which we have assumed 
is given. Essential here is that the points in the array form a realizable set of 
point correspondences. Justification of this is deferred for now. The result of the 
algorithm A{n,m + 4) is a set of cameras P'^ and points such that xj — P^Xi. 
In addition, corresponding to the last four rows of the array, there are points 
Xm+k such that ej, = P^X„+fc for all j. 
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Step 5 : 3D Transform. 



Since the reconstruction obtained in the last step is a projective reconstruction, 
one may transform it (equivalently, choose a projective coordinate frame) such 
that the points X^+Zc are the four points of a partial canonical basis for 
■ The only requirement is that the points X,„+fe obtained in the projective 
reconstruction not be coplanar. This assumption is validated later. 

At this point, one sees that = V^'K.m+k = P^E^. From this it follows that 
P has the special form 



c-1 



( 4 ) 



Step 6 : Dualize. 

Let Xi = (Xi,Yi,Zi,Ti)^ , and P'^ be as given in (0. Now define points Xj = 
{a ^ , V ,d^)^ and cameras 



Then one verifies that 







P'*Xj = (XiO-^ + Tid\YiV + Tid^Zic’ + Tid^y 



= P X,, 




If in addition, one defines X„+fc = E^ for k = 1, ... ,4, then P'*X„+fc = e^. It 
is then evident that the cameras P'* and points Xj and X„+fc form a projective 
reconstruction of the transformed data array obtained in Step I of this algorithm. 



Step 7 : Reverse Transform. 

Finally, defining P® = (T®)“^P'®, and with the points Xj and X„+fc obtained 
in the previous step, one has a projective reconstruction of the original data. 
Indeed, one verifies 



p®Xj = (T®)"^p'®Xj = (r)-^x'® = X® . 

This completes the description of the algorithm. One can see that it takes 
place in various stages. 

1. In Step 1, the data is transformed into canonical image reference frames 
based on the selection of 4 distinguished points. 
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2. In Steps 2 and 3 the problem is mapped into the dual domain, resulting in 
a dual problem V{n, m + 4). 

3. The dual problem is solved in step 4 and 5. 

4. Step 6 maps the solution back into the original domain. 

5. Step 7 undoes the effects of the initial transformation. 

3.1 Justification of the Algorithm. 

To justify this algorithm, one needs to be sure that at Step 4 there indeed exists 
a solution to the transformed problem. Before considering this, it is necessary to 
explain the purpose of Step 3, which extends the data by the addition of rows of 
image points e^, and Step 5, which transforms the arbitrary projective solution 
to one in which four points are equal to the 3D basis points E^. 

The purpose of these steps is to ensure that one obtains a solution to the 
dual reconstruction problem in which P'^ has the special form given by 0) in 
which the camera matrix is parametrized by only 4 values. The dual algorithm 
is described in this manner so that it will work with any algorithm A(n, m + 4) 
whatever. However, both Steps 3 and 5 may be eliminated if the known algorithm 
A(n, m+4) has the capability of enforcing this constraint on the camera matrices 
directly. Algorithms based on the fundamental matrix, trifocal or quadrifocal 
tensors may easily be modified in this way, as will be seen. 

In the mean time, since P'^ of the form is called a reduced camera ma- 
trix, we call any reconstruction in which each camera matrix is of this form a 
reduced reconstruction. Not all sets of realizable point correspondences allow a 
reduced reconstruction, however, the following result characterizes sets of point 
correspondences that do have this property. 

(3.1.7). A set of image points {x® : i = 1, . . . ,m ; j = 1, ... ,n} permits a 

reduced reconstruction if and only if it may be augmented with supplementary 
correspondences x^_|_j, = e/j for k = 1, . . . , 4 such that 

1. The total set of image correspondences is realizable, and 

2. The reconstructed points corresponding to the supplementary image 

correspondences are non-coplanar. 

Proof. The proof is straight-forward enough. Suppose the set permits a reduced 
reconstruction, and let P® be the set of reduced camera matrices. Let points 
= Efc for /c = 1, ... ,4 be projected into the m images. The projections are 
^n+k = = P®Efe = ek for all i. 

Conversely, suppose the augmented set of points are realizable and the points 
X„+fc are non-coplanar. In this case, a projective basis may be chosen such that 
X„+fc = Efc. Then for each view, one has efc = P®Efc for all k. From this it follows 
that each P® has the desired form ®. □ 

One other remark must be made before proving the correctness of the algo- 
rithm. 
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(3.1.8). If a set of image points {x® : i = 1, . . . ,m ; j = 1, ... ,n} permits a 
reduced reconstruction then so does the transposed set {x^ : j = 1, . . . ,n ; i = 
1, . . . , m} where x^ = x® for all i and j. 

This is the basic duality property, effectively proven by the construction given 
in Step 6 of the algorithm above. Now it is possible to prove the correctness of 
the algorithm. 

Proposition 3. Let x® and x(^_|_j, as in Fig]^left) be a set of realizable image 
point correspondences, and suppose 

1. for each i, the four points x(,^^ are non-collinear, and 

2. the four points X„_|_fc in a projective reconstruction are non-coplanar. 

Then the algorithm of section\^ will succeed. 



Proof. Because of the first condition of the theorem, transformations T® exist 
for each i, transforming the input data to the form shown in Fig 0( right). This 
transformed data is also realizable, since the transformed data differ only by a 
projective transformation of the image. 

Now, according to ( MH3 applied to Fig Enright), the correspondences 
x'® admit a reduced realization. By ( (3.1. m the transposed data Fig 0^1eft) 
also admits a reduced realization. Applying ( once more shows that 

the extended data Fig Enright) is realizable. Furthermore, the points are 

non-coplanar, and so Step 5 is valid. The subsequent steps 6 and 7 go forward 
without problems. □ 



The first condition may be checked from the image correspondences xj. It 
may be thought that to check the second condition requires reconstruction to be 
carried out. It is, however possible to check whether the reconstructed points will 
be coplanar without carrying out the reconstruction. This is left as an exercise 
for the reader. 



4 Refinements to the Dual Algorithm 

The dual algorithm as presented above gives a way of dualizing any given pro- 
jective reconstruction algorithm. The main weakness of this approach is that 
it ignores possible noise in the measurements. Noise ought to be considered at 
several points. 



Direct Enforcement of Reduced Reconstruction. 

Steps 3 and 5 of the algorithm are used to make sure that the camera matrices 
in the computed reconstruction are of the form ®. The trouble with this is 
that the points x^+j, = e^ are treated as any other point in the reconstruction. 
In the presence of noise, most algorithms, such as those based on multifocal 
tensors find reconstructions for which the input point correspondences are only 
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approximately satisfied, to the extent that is possible given the level of noise. 
However, in order that the camera matrices should be of the correct form, it is 
necessary that the correspondences = ej, be satisfied exactly. Thus, these 
correspondences must be treated differently from the others. 

Preferable would be to enforce the constraint that the camera matrices are 
of the form directly. In the case where n = 2 the algorithm A{n, m + 4) used 
to obtain the reconstruction in the dual domain may be the 8-point algorithm. 
Apart from assuming that each of the camera matrices is reduced, one may 

assume further that the first one has the special canonical form P = [l | O]. In 
•^2 

this case with P given as in (0 one computes that the fundamental matrix has 
the form (up to a scale factor) 



F = 



0 — 6 c 
a 0 — c 
—a b 0 



( 5 ) 



The 8-point algorithm may easily be modified so that the computed fundamental 

^2 

matrix has this form. The retrieval of the reduced camera matrix P from m is 
then trivial. 

In the case where n = S, one may use an algorithm based on the trifocal 
tensor. For three general camera matrices [l | O], A = [al] and B = \b^] the 
general formula for the trifocal tensor was given in ^ to be 

Tt = a>b^ - aib^ (6) 

for 1 < f, j, fc < 3. Translated into the notation of the present paper and applied 
to reduced camera matrices P = [l | O], P and P of the form 0 (and assuming 
that d} = dP = 1) one sees that there are only 15 non-zero entries of and 
these entries of are linear in terms of the values a*, 6* and c* for j = 2, 3. Thus, 
one may solve linearly for the corresponding to reduced camera matrices, 
and in fact find the entries of the reduced camera matrices linearly. 



The Transformations T* 

The most serious difficulty is finding a well-performing algorithm using this 
dualization scheme to reduce to a known algorithm is how to handle the trans- 
formations T*. Application of projective transformations to the image data has 
the effect of distorting any noise distribution that may apply to the data. The 
problem also exists of choosing four points that are non-collinear in any of the 
images. If the points are close to collinear in any of the images, then the projec- 
tive transformation applied to the image in Step 1 of the algorithm may entail 
extreme distortion of the image. In the algorithm discussed in j^j for computing 
the quadrifocal tensor, this sort of distortion was shown to degrade performance 
of the algorithm severely. 
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5 Experimental Performance 

Algorithms based on the fundamental matrix (the 8-point algorithm) for two 
views and the trifocal tensor (three views) were dualized, resulting in algorithms 
for 6 or 7 points in any number of views. The results of these tests were reported 
as a student report in August 1996 by Gilles Debunne. Since this report is 
effectively unavailable, the results are summarized here. 

Performance of the algorithms was generally unsatisfactory, mainly due to 
the distortion of the noise by the application of the transformations V. It was 
observed that errors due to noise may be minimized in Step 4 of the algorithm. 
Reversing the dualization in Step 6 of the algorithm results in the same small 
errors. However, when the inverse projective transformations are applied in Step 
7, the average error became very large. Some points retained quite small er- 
ror, whereas in those images where distortion was significant, quite large errors 
resulted. 

Normalization in the sense of P] is also a problem. It has been shown to 
be essential for performance of the linear reconstruction algorithms to apply 
data normalization. However what sort of normalization should be applied to 
the transformed data of Fig Enright) which is geometrically unrelated to actual 
image measurements is a mystery. 

To get good results, it would seem that one would need to propagate assumed 
error distributions forward in Step I of the algorithm to get assumed error distri- 
butions for the transformed data FigEJright), and then during reconstruction to 
minimize residual error relative to this propagated error distribution. However, 
the fundamental matrix and trifocal tensor algorithms do not provide ways of 
dealing with arbitrary error distributions. 

6 Conclusion 

Duality as introduced by Carlsson is a very interesting theoretical tool for under- 
standing camera projection. It seems also to have potential to provide algorithms 
for reconstruction from image sequences containing a large number of images. 
To this point, however, problems with dealing with noise distributions are an 
impediment to good performance. 

There seems to be good hope, however for eventually using methods like this 
for finding linear algorithms for carrying out reconstruction from extended image 
sequences. Finding such a method would represent a significant advance, since 
at present linear methods for reconstruction have been limited to reconstruction 
from small numbers of views. 
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Abstract. We introduce a unified framework for developing matching 
constraints of multiple affine views and rederive 2-view (affine epipo- 
lar geometry) and 3-view (affine image transfer) constraints within this 
framwork. With the insight into the particular structure of these multiple- 
view constraints, we first describe a new linear method for Euclidean 
motion and structure from 3 calibrated affine images. Compared with 
the existing linear method of Huang and Lee |H], the new method uses 
different and more appropriate constraints. It has no failure mode of 
the Euclidean factorisation method of Tomasi and Kanade m- We then 
describe how to integrate points and lines and establish some minimal 
point/line configurations for structure recovery. The method is demon- 
strated on real image sequences. 



1 Introduction 

Motion/structure from orthographic or weak perspective views is a very old 
and popular topic. It is well known that at least 4 non-planar points over 3 
orthographic or weak perspective views are sufficient to uniquely determine mo- 
tion/structure up to a reflection about the image plane |22I6I7| . Many algorithms 
have been published for this problem: the linear methods of Huang and Lee m, 
non-linear algebric methods of Koenderink and Van Doom 1321 and non-linear 
numerical method of Shapiro et al. US!. A good review of the different methods 
can be found in PSI. 

In this paper, we introduce a unified framework for developing matching 
constraints of multiple affine views. In particular, 2- view and 3- view constraints 
will be derived, and all existing methods could be recast into this framework. Our 
key observation is that classical linear methods for metric motion/structure from 
3 orthographic or weak perspective views were heavily based on affine epipolar 
geometry and did not use the full set of 3 image constraints, thus leading to 
over-parameterisation and inconsitent motion recovery. Based on insight into 
the particular structure of these constraints, we will propose a linear algorithm 
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that uses 9 linear parameters to encode the 8 Euclidean motion parameters of 3 
weak perspective views. After that, we investigate how points and lines can be 
integrated into the same framework thanks to the common matching constraints. 

This work was also partly motivated by the application of novel image syn- 
thesis from example images, as Euclidean reconstruction from a minimal number 
of images is required here Part of this work was also presented in im 

The paper is organised as follows. In Section we review the affine cam- 
era model. Then, we introduce a unified framwork for studying the geometric 
constraints among multiple affine images in Section 0 The linear method for 
Euclidean motion/structure from 3 calibrated affine images is developed in Sec- 
tion El The integration of points and lines will be described in Section 0. Ex- 
perimental results are presented in Section 0 and a short conclusion is given in 
Section □ 

Throughout the paper, matrices are denoted in upper case boldface, vectors 
in lower case boldface and scalars are either in lower case or lower case Greek. 



2 Review of the AfRne Camera Model 



For a restricted class of camera models, by setting the third row of the perspective 
camera P 3 X 4 to (0,0,0, A), we obtain the affine camera initially introduced by 
Mundy and Zisserman in unD 



Asx4 = 



fPll P12 Pl3 Pl4^ 
P2I P22 P23 P24 
0 0 0 P34J 



f M2x3 
V 0ix3 




( 1 ) 



The affine camera A 3 X 4 subsumes the orthographic, weak perspective and 
para-perspective. For more detailed relations and applications, one can refer to 

naisi- 

Finite points in affine spaces 7?.", are naturally embedded into P" by the 
mapping Uq 1 — > u = (u^, 1)^ and 1 — > x = (xq, l)"^. We have therefore Uq = 

M2x3Xq -kto, where to = (h/ts, ^ 2 /^ 3 )^ = (P14/P34,P24/P34)^- If we further 
use relative coordinates of the points with respect to a given reference point (for 
instance, the centroid of a set of points), the vector to is cancelled, and we obtain 
the following linear mapping between relative space points and relative image 
points: 

Z\u = M 2 X 3 AX. (2) 



Equation 0 is the basic projection equation for points in an affine camera 
when relative coordinates are used. The reference point to determine the rela- 
tive coordinates determines uniquely the translational component of the affine 
projection matrix. Throughout this paper, the reference point is always taken to 
be the centroid in each image. 
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3 Unifying 2- View and 3- View Geometry of Points 



For projective cameras, the geometric constraints among multiple projective 
views have been thoroughly studied in There has been no similar 

effort for affine camera case, although the geometric constraints among affine 
views are well known. 

For notational simplicity, rewrite Equation o as u = M 2 X 3 X. 

We can now examine the matching constraints between multiple views of the 
same point. Let the three views of the same point x be given as follows: 

{ u = Mx, 

u' = M'x, (3) 

u" = M"x. 



These can be rewritten together in matrix form as 



/M u\ 
M' u' 
\M" u'7 




(4) 



where A 7 ^ 0 encodes the (unrecoverable) global scale factor of the recon- 
struction. 

As the vector (x, A)^ can not be zero, the rank of the coefficient matrix is at 
most 3, so all of its 4 x 4 minors vanish. There are = 15 = 3-1-4-1-4-1-4 such 
minors, which can be divided into two types: 

— 2 -view constraints involving only two views with two rows from each view, 

— 3- view constraints involving all three views with two rows from one view and 
one from each of the others. 



There are three 2- view and three sets of four 3- view constraints. Among the 3 
sets of 3- view constraints, only one of them is independent due to the symmetry. 

Each expansion of these 4x4 minors is linear in the image coordinates u, 
u' and u" with the coefficients ti coming from the 3x3 minors of the following 

/M\ 

6x3 joint projection matrix: = (1 2 1 ' 2 ' 1 " 2 ")^. 

\m"/ 



There are in total Cl = 20 = 84-4-1-8 such minors, as we will see later that 
4 of the 20 minors are common to the 2- view and 3- view constraints. All these 
minors provide a linear coordinate system to span the joint projection matrix. 
The constraints for more than 3 views will be briefly discussed in Section 0 



3.1 Two- View Constraints 

There are three 2-view constraints corresponding to the 3 pairs of the 3 views, 
namely the vanishing of the determinants [ 121 ' 2 'J, [ 121 " 2 "] and [ 1 ' 2 ' 1 " 2 "]: 

ti3U + ti4V + tiou' +tgv' = 0, 
ti^v! -\- tiQv' tnu” -\- tigv” = 0 , 
tigU t20V + ti2u" -\- tlxv" = 0 . 
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These are the affine epipolar geometry. The set of 3 x 4 = 12 coefficients U for 
t = 9, . . . , 20 are 12 of the 20 minors of the joint projection matrix. 

Each point correspondence from two images gives one homogeneous linear 
equation, taking into account of the reference point for relative coordinates, 4 
points are sufficient for uniquely determining the affine epipolar geometry. 

The affine epipolar geometry equation was introduced in jS| for orthography, 
and later in jS] for weak perspective and also in HH|. Shapiro et al. nicely re- 
lated the affine epipolar geometry with Koenderink and Van Doom’s rotation 
parameterisation 0. But Koenderink’s method is equivalent to Lee and Huang’s 

n 

3.2 Three-View Constraints 

There are four 3- view constraints from the vanishing of the determinants [121'!"], 
[122'!"], [121'2"] and [122'2"]. By careful inspection of the minors (for example, 
using the computer algebra tool Maple), we have: 



t4 U 


+ h V 


+ tiiU 


+ tg u” 


= 0 , 


tg U 


+ U V 


+ tiiv' 


+ t\gu" 


= 0 , 


H u 


+ tT V 


-\- ti2u' 


+ tg v” 


= 0 , 


ti U 


+ h V 


+ t\2v' 


+ twv” 


= 0 . 



Among 12 minors, 8 of them are new (ti to tg) and 4 of them are common with 
2- view constraints (tg to ^ 12 ). 

These are the transfer equations over three views m- Any orthographic view 
of a point set can be expressed as a linear combination of two other views if this 
point set undergoes a linear transformation in space. This has been extensively 
used in object recognition. 

Since each point correspondence gives 4 linearly independent 3-view con- 
straints, 4 points give (4 — 1) x 4 = 12 linear equations for solving these minors. 

The appearance of common minors between 2-view and 3- view constraints is 
not accidental, as we have 34-4=7 constraints, each of them has 4 coefficients, 
that amounts to 4 x 7 = 28. As there are only 20 minors, so 8 of them should 
appear at least more than once. 



4 Euclidean Motion/Structure from 3 Calibrated AfRne 
Views 

So far, the linear estimates of the 2-view and 3-view constraints yield, directly 
but implicitly, the affine motion/structure. To get Euclidean or more exactly 
similarity motion/structure, we need at least 3 calibrated affine images. Here 
we use the unified formulation introduced in US! for calibrated affine cameras, 
thus the method developed will be valid for all calibrated orthographic, weak- 
perspective and para-perspective models. 
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Each projection matrix M 2 X 3 can be decomposed H2I into M = sKR, where 
s is a scaling factor of the whole image, K is the intrinsic parameter matrix 



case), and R 2 X 3 represents 2 rows of a 3D rotation matrix. As we are assuming 
calibrated cameras, the intrinsic parameter matrix K is known and its inverse can 
be directly applied to the image points so that its effect is removed completely. 
So the projection matrix M for normalised image coordinates becomes M = sR, 
i.e. a scaled 2x3 rotation matrix. There are in total 8 = 2 x 3 + 2 Euclidean 
parameters for a set of 3 views: two relative 3D rotations R and G each having 
3 d.o.f. and 2 relative scale factors s and s' between (say) the first view and the 
remaining ones. 

Any linear algorithm will consist of first estimating linearly the minors using 
multiple view constraints, then extract the 8 Euclidean parameters from these 
minors by identifying the projection matrices M and with the scaled rotation 
matrices sR and s'G. 

Combining 2-view and 3-view constraints We should keep in mind that although 
all 7 constraints are linearly independent, only 3 = (6 — 3) x (4 — 3) of them are 
algebraically independent due to Grassmanian relations. There are in total 20 
homogeneous coefficients — minors of the joint projection matrix. How to choose 
the most appropriate constraints is of primary importance. The selected con- 
straints should be algebraically independent and contain as few coefficients as 
possible. 

Taking only the three 2-view constraints is a poor choice since the third 
one is partially dependent on the first two by the composition rule on the rigid 
motions and each one is completely separate from the others. Taking only the 3- 
view constraints leads to a complicated algebraic manipulation for the extraction 
of Euclidean parameters. Hence we combine 2-view and 3-view constraints. 

The key observation is that there are common coefficients between 2- view and 
3- view constraints: two of the three 2-view constraints share 4 minors tg, tio, tii 
and ti 2 with the 3-view constraints. This allows us to use the following combi- 
nation: two 2-view constraints plus one of the 3-view constraints. 



These 10 unknown minors can be solved as a single homogeneous vector under 
the constraint ||tio|| = 1: 



(for instance, K 




Q ^ j , where ^ is the aspect ratio for the weak-perspective 



<4 u -\-tsv -\- tiiu' -\- tg u" = 0, 

tl 3 U -I- tiiV -\- tigu' -\-tg v' = 0, 

tigU -\- tiQV -\- tigu" -\- tiiv" = 0 . 




Any ratio ti/tj of the minors is therefore obtained. 
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Obtaining partial solutions from 2-view constraints First, from the estimated 
minors of the 2-view constraint, we can easily obtain the partial Euclidean so- 
lution based on Shapiro et al.’s reformulation of Koenderink and Van Doom’s 
rotation representation. Koenderink and Van Doom’s representation is probably 
the most appropriate parameterisation, since it distinguishes clearly between the 
entities which can be obtained from two views and those that can not. 

Assume that the affine epipolar geometry of two views is estimated as 



From 3 ratios a : b : c : d, exactly 3 Euclidean parameters can be extracted. 
Using a scaled rotation matrix instead of M, the following relation holds: a : b : 
c: d= sr32 ■ -s?'3i : ?'23 : -ffis- 

Therefore, the relative scale factor between the two views is immediately 



According to Koenderink and Van Doom, the entire rotation can be decom- 
posed into a rotation in the image plane (assume this rotation angle to be 9) 
and a rotation through an angle p about an axis (angled at (f to the positive x 
axis) in a frontoparallel plane. The rotation matrix in terms of Koenderink and 
Van Doom’s 9 — tp — p representation m can be recomposed as Rsxs: 



f{^-c{p))c{<j))c{(l)-9)+((p)c{9) {1 - c{p))c{(j))s{(j) - 9)-c{p)s{9) s{(l))s{p) \ 
(1 - c{p))s{<f)c{(t> - S) + c{p)s{9) (1 - c{p))s{4>)s{p - 9) + c{p)c{9) -c{4,)s{p) . 
\ -s{p)s{(j)-9) s{p)c{p-9) c{p) ) 



Therefore, 

a : b : c : d = s sin p cos{4> — 9) : s sin p sin(^ — 9) : — cos p sin p : — sin p sin p 



Obtaining the full solution with the 3-view constraint Up to this point, the only 
unknown is the rotation angle out of the image plane p, which is the only com- 
ponent that generates depth information. The one-parameter family of solutions 
for the rotation matrix between the two views is 



where D 2 X 2 , E 2 x 2 , F 2 XI and Gix 2 are the known quantities. 

Similarly, with the second 2-view constraint, we get another one-parameter 
family of solutions for the rotation matrix G^xsip') of the other 2 views in terms 
of the unknown rotation angle out of the image plane p' . 

Now, it is time to use the 3-view constraint to fully determine the mo- 
tion/structure. It can be easily verified that 



(a, b, c, d){u, V, u' , v'Y’ = 0- 



given as s = 



hence, the rotation angle in the image plane is easily determined by tan p = ^ 
and the rotation axis modulo tt out of the image plane by tan(</> — 9) = ^. 




t4 : ts ■■ tn : tg = ss'(rngi3 ~ ffiiris) '■ ss'(ri2gi3 ~ ffi27’i3) : -s'ffis : s?'i3- 
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Substituting the ratio tgltn into t 4 ^jtn and ts/tn, we get exactly 2 linear 
equations in cos p and cos p', 

^ = a cos p + b cos p' + c , 
i = a' cos p + b' cos p' + c', 



i.e. 



A2x2 



f cosp\ 
l^cos p' J 



— B2x1- 



Remarks 

— This new formulation contains only 9 = 10—1 independent parameters, 
compared with the 8 Euclidean motion parameters of the set of 3 views, it 
is a minimal linear parameterisation. 

— The advantage of solving t as a whole is that it provides more information, 
for instance the knowledge of the ratio tg : tn which could not be recovered 
from 2- view constraints is the key for a linear solution in cos p and cos pL 
Huang and Lee had to re-compute this ratio at the very begining of the 
second step with the rotation composition constraint. 

— Solving directly and linearly for cosp and cos p' is of great significance. On 
the one hand, the intrinsic two-way ambiguity is nicely expressed by the fact 
that 

cos(— p) = cosp. 

This parameterisation indeed makes a linear solution possible since the two 
equations are solved together. On the other hand, the only failure mode of 
the entire linear algorithm is the possibility that cos p > 1 which may happen 
due to numerical error when cosp is close to 1. Since 

cosp « 1 p ~ 0, 

this means that actually there is almost no rotation out of the image plane. 
As rotation out of the image plane is the only component which contains 
depth information, this means that the 2D images we used do not contain 
the desired 3D structure, or equivalently we can report that p = 0 for the 
motion. Essentially, this algorithm does not have the failure mode that the 
factorisation method suffers in its linear version. 



4.1 Comparison with Related Work 

We first make a comparison with the existing linear algorithms of Huang and 
Lee and the factorisation method of Tomasi and Kanade. 

Basically, there are two steps in Huang and Lee’s method |S|, the first step 
computes the coefficients of the three 2-view constraints. Any 3-view constraint 
was totally absent during the batch solution step and was introduced in the sec- 
ond step by the composition rule of rotation matrices R13 = R23Ri2- In our new 
linear algorithm, the 3-view constraint has been already integrated in the first 
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numerical step. No any other constraint was used afterwards. Since Euclidean 
depth information is only contained in 3-view constraints, therefore, it is dan- 
gerous to not use any 3-view constraint during the numerical step. Although 3 
2-view constraints have also 9 = 3x3 independent parameters if each one is 
estimated individually, but they are essentially a set of 12 homogeneous param- 
eters which breaks up into 3 sets of 4 homogeneous parameters. If we examine 
carefully the second step of Huang and Lee’s method, it has to recompute this 
set of 12 homogeneous parameters (via the ratios between these 3 sets of 4 ho- 
mogeneous parameters) using the 3-view constraint. In conclusion, Huang and 
Lee’s method does not use the appropriate constraints. It will inevitably lead to 
the inconsistencies of the rotation matrix. 

Compared with the factorization method of Tomasi and Kanade mm- 
most suitable for redundant views, the major problem is to impose the ’metric 
constraint’. The linearly estimated matrix which is the product of an affine trans- 
formation and its transpose may not be positive definite, the whole Euclidean 
reconstruction process fails. The exact Cholesky parameterisation of the matrix 
introduced in m needs to solve simultaneous quadratic equations. 

Other important work include Koenderink and Van Doom jjj. The method 
consists of three steps. The first step shows that the scale change between 2 views, 
the rotation in the image plane around the viewing direction and the projection of 
the rotation axis out of the image plane can be obtained with 2 views of 4 points. 
The second step is to parameterise the remaining Euclidean structure with the 
angle of the rotation out of the image plane and the 2 depths of the reference 
triangle, then eliminate the unknown angle to get a quadratic equation on the 
2 Euclidean depths. Finally, with the third view, a second quadratic equation is 
obtained. Intersecting these two quadratics gives 4 possible solutions for the two 
depths. These intersections represent either one or two pairs of solutions that 
are related through a reflection in the fronto-parallel plane. 

We can see that the first step of Koenderink and Van Doom’s method is 
similar to that of Huang and Lee and Lee and Huang |^, it uses 2- view 
constraint to get partial solutions although Koenderink and Van Doom’s is more 
geometrically oriented. One major difference is that Koenderink and Van Doom 
do not use the third 2-view constraint as Huang and Lee did. Unfortunately, 
Koenderink’s method needs to intersect two quadratics for each pair of points, 
and can not handle all available points. 

Shapiro et al. m extended Koenderink and Van Doom’s first step by nicely 
relating the Koenderink and Van Doom’s rotation representation to the affine 
epipolar geometry. Unfortunately they failed to get a closed form solution and 
adopted a non linear numerical optimisation approach for the 3-view case. 

Ullman and Basri and Poggio [121 considered the 3- view constraint for lin- 
ear combination for recognition. Although they essentially show the equivalence 
with motion/structure, they do not concentrate on motion/structure recovery, in 
fact there is no closed form solution which allows Euclidean structure extracted 
directly from linear combination coefficients. 
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5 Unifying 3- View Geometry of Points and Lines 

Using only line segments for motion/structure within the affine camera frame- 
work has been adressed in unm. We can now examine the possibility of com- 
bining points and line segments into the same framework. 



5.1 Trifocal Tensor of Lines 



Recall that there exists a linear mapping between directions of 3D lines x and 
those of 2D lines u It can be derived even more directly using projective 

geometry, by considering that the line with direction is the point at infinity 
Xoo = (d^, 0)^ in and the line with direction d„ is the point at infinity Uoo 
in V^. 

These can be rewritten in matrix form as 



/M u 0 0 \ 
M' 0 u' 0 
\M" 0 0 u"j 



( " \ 

-A 

-A' 

\-X"J 



which is the basic reconstruction equation for a one-dimensional camera. The 
vector (x, —A, —A', — A")^ cannot be zero, and so 



M u 0 0 
M' 0 u' 0 
M" 0 0 u" 



= 0 . 



(5) 



The expansion of this determinant produces a trilinear constraint of three 
views 



T.jkU^u'^u”^ = 0 , 



( 6 ) 



where Tijk is a 2 x 2 x 2 homogeneous tensor whose components Tijk are 3x3 
minors of the following 6x3 joint projection matrix: 








The components of the tensor can be made explicit as 



Tijk = [ifk"], for i,f, k” = 1, 2. 



(7) 



where the bracket \ij'k”] denotes the 3x3 minor of i-th,/-th and fc"-th row 
vector of the above joint projection matrix and bar “ “ ” in i, j and k denotes 
the mapping 



( 1 , 2 ) ^( 2 ,- 1 ). 
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5.2 Trifocal Tensor of Points 

Let re-take the 3-view constraints for points, 

tiU + tsv + tiiu' + tgu" = 0 , 
t2U + tev + tiiv' + tiou” = 0 , 
tsU + t^V + ti 2 u' + tgv" = 0 , 
tiU + t^V + ti 2 v' + tiov" = 0 

By eliminating tg and tio, we get two bilinear forms and furhter elimination 
between the bilinear forms gives the following trilinear form: 

= 0 , 

with the tensor-vector mapping defined by 4(i — 1) -|- 2(j — 1) -|- fc. 

5.3 Method 1 — Using the Trifocal Tensor 

As both points and lines share exactly the same trifocal tensor Tijk, an immediate 
integration method consists of estimating this common tensor with both points 
and lines. 

T.jkU^u'^u'"^ = 0 . 

As a matter of fact, it’s not surprising at all, since each point in relative 
coordinates may be considered as a line segment. Even more, a set of n points is 
equivalent to lines. I.e. points are just normal points and line directions are 
properly “scaled” points. 

One major drawback of this method is that we are not using the information 
on the known scales of the “points” . 

5.4 Method 2 — Using 3- View Constraints and Trifocal Tensor 

Another method of integrating points and lines is to look for the common minors 
among multiple view constraints and the trifocal tensor. 

As we have seen that the trifocal tensor shares all its 8 minors with the 3- 
view constraints (it’s reasonable that the tensor has nothing common with the 
2-view constraints) via 

Tijk t4{i—l)+2{j—l)+k- 

We can therefore solve the following homogeneous linear system of equations: 

t 4 U + tsV + tiiu' + tgu" = 0 , 
tgU + tgV + tiiv' + tigu" = 0 , 

< tgU + t^V + ti 2 u' + tgv" = 0 , 

tiU + t^V + ti2v' + tigv" = 0, 

T.jkU^u'^u'"^ = 0 , 

i.e. 



A(ti, ^2, ■ • ■ , tl2)^ — 0. 
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5.5 Some Minimal Configurations 

By suitable switching between two methods, we can establish the following min- 
imal geometric configurations for points and lines: 

— 4 points -I- 0 line 

— 3 points -I- 3 lines (2 x 4 -|- 3) 

— 2 points -I- 6 lines (H-6) 

— 1 point -I- 7 lines (0-1-7) 

— 0 point -I- 7 lines 

Generally we may proceed as follows: 

— when lines outnumbers points, estimate the trifocal tensor both for points 
and lines — method 1; 

— when points outnumbers lines, estimate 12 components — method 2. 

6 Experimental Results 

The linear method for motion/structure from 3 calibrated affine images devel- 
oped in this paper has been implemented and applied to real image sequences. 

We first acquired a sequence of images of a calibration pattern with a stan- 
dard camera mounted on a robot. It is important to stress that the imaging 
conditions were not chosen to be close to affine. The triplet of images we used is 
shown in Figure ^ The 69 points have been automatially identified and tracked 
for the triplet. 




Fig. 1. The triplet of images of the calibration pattern. 



The 8 Euclidean motion parameters are estimated by the linear method as 
s = 1.06, (j) = -1.51, e = -0.33, p = 0.33 and s' = 1.11, (j)' = -1.53, 0' = -0.47, 
p' = 0.43. The resulting shape reconstruction from these motion parameters is 
shown in Figure El 

To evaluate the reconstruction quality, we did the same 3D reconstruction 
using a full perspective camera model, for instance the method described in 
0 is used. The two reconstructions differ by a 3D similarity transformation 
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which can be easily estimated. The normalised relative error of the Euclidean 
reconstruction with affine camera with respect to perspective camera one is 3.6 
percent. 




Fig. 2. Two views of the resulting 3D reconstruction. 




Fig. 3. Two different views of the two superimposed 3D reconstructions, one 
uses weak-perspective camera model (marked by a square) and the other full 
perspective (marked by a circle). 



We also tried our method on the popular hotel sequence kindly provided by 
Poelman and Kanade at CMU. In this sequence, the camera motion included 
substantial translation away from the camera and across the field of view. 197 
points throughout the sequence of 181 images are automatically identified and 
traced. For a more detailed description of this set-up, consult HH. The triplet 
of images we used are displayed in Figure El The resulting 3D reconstruction is 
shown in Figure El 
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Fig. 4. The triplet of hotel image sequence. 



4M 




Fig. 5. Two views of the 3D reconstruction of the hotel. 



7 Conclusion 

We have introduced a unified approach to 2-view and 3-view geometric con- 
straints of affine images, and a new linear method for Euclidean motion/structure 
from 3 calibrated affine images. What is also important for the framework of mul- 
tiple affine views developed in this paper is that line segment features could also 
be incorporated. The method has been validated on real image sequences. 



Acknowledgements 



This work is in part supported by CUMULI. 




Geometry of Multiple Affine Views 



45 



References 

1. J.Y. Aloimonos. Perspective approximations. Image and Vision Computing, 
8(3):179-192, August 1990. 

2. B.M. Bennet, D.D. Hoffman, J.E. Nicola, and C. Prakash. Structure from two 
orthographic views of rigid motion. Journal of the Optical Society of America, 
6(7):1052-1069, 1989. 

3. S. Christy and R. Horaud. Euclidean shape and motion from multiple perspective 
views by affine iterations. IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 18(11):1098-1104, November 1996. 

4. O. Faugeras and B. Mourrain. On the geometry and algebra of the point and 
line correspondences between n images. In Proceedings of the 5th International 
Conference on Computer Vision, Cambridge, Massachusetts, USA, pages 951-956, 
June 1995. 

5. R.I. Hartley. Lines and points in three views and the trifocal tensor. International 
Journal of Computer Vision, 22(2): 125-140, 1997. 

6. T.S. Huang and C.H. Lee. Motion and structure from orthographic projections. 
IEEE Transactions on Pattern Analysis and Machine Intelligence, ll(5):536-540, 
1989. 

7. J. Koenderink and A. van Doom. Affine structure from motion. Journal of the 
Optical Society of America A, 8(2):377-385, 1991. 

8. C.H. Lee and T. Huang. Finding point correspondences and determining motion 
of a rigid object from two weak perspective views. Computer Vision, Graphics and 
Image Processing, 52:309-327, 1990. 

9. Y. Mukaigawa, Y. Nakamura, and Y. Ohta. Synthesis of arbitrarily oriented face 
views from two images. In Proceedings of the Second Asian Conference on Com- 
puter Vision, pages 718-722, 1995. 

10. J.L. Mundy and A. Zisserman, editors. Geometric Invariance in Computer Vision. 
The MIT Press, Cambridge, MA, USA, 1992. 

11. C.J. Poelman and T. Kanade. A paraperspective factorization method for shape 
and motion recovery. In J.O. Eklundh, editor. Proceedings of the 3rd European 
Conference on Computer Vision, Stockholm, Sweden, pages 97-108. Springer- Ver- 
lag. May 1994. 

12. T. Poggio and S. Edelman. A network that learns to recognize three-dimensional 
objects. Nature, 343(6255):263-266, January 1990. 

13. L. Quan. Self-calibration of an affine camera from multiple views. International 
Journal of Computer Vision, 19(1):93-105, May 1996. 

14. L. Quan. Uncalibrated ID projective camera and 3D affine reconstruction of fines. 
In Proceedings of the Conference on Computer Vision and Pattern Recognition, 
Puerto Rico, USA, pages 60-65, June 1997. 

15. L. Quan. Algebraic relations among matching constraints of multiple images. Tech- 
nical report, inria, January 1998. Also TR Lifia-Imag 1995. 

16. L. Quan and T. Kanade. Affine structure from fine correspondences with un- 
calibrated affine cameras. IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 19(8):834-845, August 1997. 

17. L. Quan and Y. Ohta. A new linear method for euclidean motion/structure from 
three calibrated affine views. In Proceedings of the Conference on Computer Vision 
and Pattern Recognition, Santa Barbara, California, USA, June 1998. 

18. L.S. Shapiro, A. Zisserman, and M. Brady. 3D motion recovery via affine epipolar 
geometry. International Journal of Computer Vision, 16(2): 147-182, 1995. 




46 



Long Quan, Yuichi Ohta, and Roger Mohr 



19. A. Shashua. Algebraic functions for recognition. IEEE Transactions on Pattern 
Analysis and Machine Intelligence, 17(8):779-789, August 1995. 

20. C. Tomasi and T. Kanade. Shape and motion from image streams under or- 
thography: A factorization method. International Journal of Computer Vision, 
9(2):137-154, November 1992. 

21. B. Triggs. Matching constraints and the joint image. In E. Crimson, editor, 
Proceedings of the 5th International Conference on Computer Vision, Cambridge, 
Massachusetts, USA, pages 338-343. ieee, ieee Computer Society Press, June 
1995. 

22. S. Ullman. The Interpretation of Visual Motion. The MIT Press, Cambridge, MA, 
USA, 1979. 

23. S. Ullman and R. Basri. Recognition by linear combinations of models, ieee 
Transactions on Pattern Analysis and Machine Intelligence, 13(10):992-1006, 1991. 

24. D. Weinshall and C. Tomasi. Linear and incremental acquisition of invariant shape 
models from image sequences. In Proceedings of the fth International Conference 
on Computer Vision, Berlin, Germany, pages 675-682. ieee. May 1993. 

25. Z. Zhang, K. Isono, and S. Akamatsu. Euclidean structure from uncalibrated 
images using fuzzy domain knowledge: Application to facial images synthesis. In 
Proceedings of the 6th International Conference on Computer Vision, Bombay, 
India, 1998. to appear. 




Tensor Embedding of the Fundamental Matrix 



Shai Avidan and Amnon Shashua 

Institute of Computer Science, The Hebrew University, 
91904 Jerusalem, Israel. 

{avidan, shashuajOcs .huji .ac.il 



Abstract. We revisit the bilinear matching constraint between two per- 
spective views of a 3D scene. Our objective is to represent the con- 
straint in the same manner and form as the trilinear constraint among 
three views. The motivation is to establish a common terminology that 
bridges between the fundamental matrix F (associated with the bilinear 
constraint) and the trifocal tensor 7)^* (associated with the trilineari- 
ties). By achieving this goal we can unify both the properties and the 
techniques introduced in the past for working with multiple views for 
geometric applications. 

Doing that we introduce a 3 x 3 x 3 tensor , we call the bifocal 
tensor, that represents the bilinear constraint. The bifocal and trifocal 
tensors share the same form and share the same contraction properties. 
By close inspection of the contractions of the bifocal tensor into matrices 
we show that one can represent the family of rank-2 homography matrices 
by [J]xA where J is a free vector. We then discuss four applications of 
the new representation: (i) Quasi-metric viewing of projective data, (ii) 
triangulation, (iii) view synthesis, and (iv) recovery of camera ego-motion 
from a stream of views. 



1 Introduction 

The geometry of multiple views is governed by certain multi-linear constraints, 
bilinear for pairs of views and trilinear for triplets of views — all other multi- 
linear constraints (four views and beyond) are spanned by the bilinear and tri- 
linear constraints. 

The traditional representation of the coefficients of the bilinear constraint is 
by a 3 X 3 matrix, F, that satisfies p'^ Fp = 0 for all matching image points p,p' 
(represented in the 2D projective space) across two views. On the other hand, 
the three- view relations are represented by a set of 4 trilinear constraints, each 
of the form p’^SjVkT^^ = 0 where s and r are lines coincident with the matching 
points p' and p” , respectively. In other words, the bilinear constraint represents 
a “point -bpoint” relation, whereas each of the trilinear constraints represents a 
“point -|-line-|-line” relation. 

Because of the difference in form between the fundamental matrix and the 
trifocal tensor, the analysis tools are different and the properties discovered 
for one do not easily carry over to the other. For example, the trifocal tensor 
contracts (reduces) to matrix forms that carry geometric information: one type 
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of contraction produces subgroups of 2D homography matrices and another type 
of contraction produces a subgroup of 2D correlation matrices. There is no such 
equivalence known for the fundamental matrix, for instance. 

In this paper we revisit the bilinear constraint and represent it using a 3 x 3 x 3 
tensor = 0 where s, r are two coincident lines with the matching point 

p' . We call the tensor the “bifocal” tensor and show that not only it shares 
the same form as the trifocal tensor but it also shares the same properties. We 
can therefore consider contractions of the bifocal tensor just was is done with 
the trifocal counterpart. 

Through the inspection of tensor contractions we derive the representation 
of the subgroup of rank-2 homography matrices in the simple form of 
where 5 is a free vector. We introduce the group of “primitive homographies” 
and discuss 4 applications of the new representation: (i) Quasi-metric viewing 
of projective data, (ii) triangulation, (iii) view synthesis, and (iv) recovery of 
camera ego- motion from a stream of views. This work in its initial form was 
presented at the meeting found in p. 

2 Notations 

A point X in the 3D projective space is projected onto the point p in the 2D 
projective space 7^^ by a 3 x 4 camera projection matrix A = [A, v'] that satisfies 
p = Ax, where = represents equality up to scale. The left 3x3 minor of A, 
denoted by A, stands for a 2D projective transformation of some arbitrary plane 
(the reference plane) and the fourth column of A, denoted by v' , stands for the 
epipole (the projection of the center of camera 1 on the image plane of camera 
2). In a calibrated setting the 2D projective transformation is the rotational 
component of camera motion (the reference plane is at infinity) and the epipole 
is the translational component of camera motion. Since only relative camera 
positioning can be recovered from image measurements, the camera matrix of 
the first camera position in a sequence of positions can be represented by [/; 0]. 

We will occasionally use tensorial notations, which are briefly described next. 
We use the covariant-contravariant summation convention: a point is an object 
whose coordinates are specifled with superscripts, i.e., p* = {p^,p ^, ...). These are 
called contravariant vectors. An element in the dual space (representing hyper- 
planes — e.g., lines in V'^), is called a covariant vector and is represented by 
subscripts, i.e., Sj = (si, S 2 , ...■). Indices repeated in covariant and contravariant 
forms are summed over, i.e., p^Si = p^s\ + p^S 2 + ... + p^Sn- This is known 
as a contraction. For example, if p is a point incident to a line s in V^, then 
p®Si = 0. Vectors are also called 1-valence tensors. 2- valence tensors (matrices) 
have two indices and the transformation they represent depends on the covariant- 
contravariant positioning of the indices. For example, is a mapping from points 
to points, and hyper-planes to hyper-planes, because a^p* = and alsj = ri 
(in matrix form: Ap = q and = r); aij maps points to hyper-planes; and 

maps hyper-planes to points. When viewed as a matrix the row and column 
positions are determined accordingly: in and Uji the index i runs over the 
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columns and j runs over the rows, thus b^al = c* is BA = C in matrix form. 

^ ' J L t 

An outer-product of two 1-valence tensors (vectors), aiV , is a 2- valence tensor 
whose i,j entries are UiV — note that in matrix form C = haA . An n- valence 
tensor described as an outer-product of n vectors is a rank-1 tensor. Any n- 
valence tensor can be described as a sum of rank-1 n-valence tensors. The rank 
of an n-valence tensor is the smallest number of rank-1 n-valence tensors with 
sum equal to the tensor. For example, a rank-1 trivalent tensor is aibjCk where 
Ui,bj and Ck are three vectors. The rank of a trivalent tensor aijk is the smallest 
r such that, 

r 

C^ijk — ^ '^ ^isbjsCks- (1) 

We will make extensive use of the “cross-product tensor” e defined next. The 
cross product (vector product) operation c = axb is defined for vectors in V^. 
The vector c is the line joining the points a, b, or the point of intersection of the 
lines a, b. The product operation can also be represented as the product c = [a] x & 
where [a\x is called the “skew-symmetric matrix of a” and has the form: 

( 0 -03 02 \ 

03 0 -Oi 

-02 oi 0 / 

In tensor form we have eijka^V = Ck representing the cross products of two 
points (contravariant vectors) resulting in the line (covariant vector) Ck- Simi- 
larly, e^^^aihj = A represents the point intersection of the to lines Oi and bj. 
The tensor e is defined such that Cijka^ produces the matrix [o]x (i-C., e contains 
0,— 1,1 in its entries such that its operation on a single vector produces the 
skew-symmetric matrix of that vector). 

3 Tensor Embedding of the Fundamental Matrix 

Our goal is to derive a trivalent tensor representation (i.e., a 3 x 3 x 3 tensor) 
of the 3x3 fundamental matrix and to illuminate the advantages of doing so. 
In particular, once we have the trivalent tensor representation in our hand we 
wish to investigate its contraction properties (as was done for the trifocal tensor 
in m) and recast them back in matrix form. 

We start with deriving the fundamental matrix from basic principles. Let A 
be a 2D homography (collineation) from image 1 to image 2 due to some plane 
7T, i.e., if p is a point in image 1, then Ap is a point coincident with the epipolar 
line p' X v' in image 2, where the exact location of Ap on the epipolar line is 
determined by the position of the plane tt. Thus, {v' x p')^ Ap = 0, or in tensor 
notation, 

0 = eijpp'^v'Pp^a\ 

= P'^ 

E,. 
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where eijpp'^v'P is the cross-product p' x v’ . The matrix Fji = eijpv'^a\ is the 
fundamental matrix that satisfies the bilinear constraint p^p'^ Fji — 0 (cf. IHEI)- 
In matrix form, since £ijpv'^ is the skew-symmetric matrix [w'] x , then F = 
K]xA 

Next, we begin with the bilinear constraint p‘‘p'^Fji = 0 and consider re- 
placing the point p' with a cross product of any two incident lines s,r, i.e., 
pii _ The reason for doing so will be apparent later on. We have there- 

fore a “point -|-line-|-line” relationship p^SjVkJ-il^ = 0 as follows: 

pYFu = p^ Fu 

p/l 

= p'sjVk 

j^jh 

= 0 

and the tensor = e^^^Fu is a trivalent form of the fundamental matrix. This 
form is equivalent to considering the trifocal tensor of views 1,2,3 where views 
2,3 are identical. Thus we obtain a relationship between three views, but only 
two of the views are distinct. We can represent the “bifocal” tensor Ff^ directly 
as a function of v' and A as follows: 

The importance of the trivalent tensor embedding of the fundamental matrix 
(which we will denote by bifocal tensor from now on) is that we have arrived to 
an equivalent representation with 3-view geometry: both the trifocal and bifocal 
tensors are 3x3x3 and operate on a configuration of a point -|- lined- line. In the 
case of three views, the lines are in two distinct views (the line s coincides with 
p' and the line r coincides with p") and there are 4 such relationships (due to 
the fact that there are two choices for each line). In the case of two views the 
two lines are in the same view and therefore there is only one configuration of 
point -|- lined- line. 

The advantage of this equivalence in form between the trifocal and bifocal 
tensors appears when one considers contractions into bivalent forms (matrices). 
The properties of contractions of the trifocal tensor are well understood (see 
[f221 1 iSj and in the appendix here) and provide the building blocks for making 
use of the trifocal tensor in applications. We can apply now an identical analysis 
on the bifocal tensor which we will do next. 

3.1 Bifocal Tensor Contractions 

Given an arbitrary vector 5, the trifocal tensor reduces to a matrix of three 
types: and SkF/’^ . Note that when S = (1, 0, 0), (0, 1, 0) or (0,0,1) 

we obtain “slices” of the tensor. The first type produces a rank-2 correlation 
matrix, i.e., a mapping from all 2D lines to collinear points (where the orien- 
tation of collinearity is determined by ^) — by slicing the tensor in that way 
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Fig. 1. The matrix [5] x F is a homography matrix due to a plane tt coincident 
with the center of projection O' and the line <5 in view 2. The line s is the epipolar 
line and the point P,r is at the intersection of the optic ray from the first view 
and the plane tt. The point is the projection of onto view 2. Therefore 
the point+line+line configuration of p, 5, s satisfies the bifocal tensor relation 
= 0, where is the matrix [5]xF. 



we obtain the three matrices of “line geometry” introduced in the calibrated 
context by [2dl24l2?^ . The second and third types produce homography matrices 
(collineations) . The second type is a homography matrix from view 1 to 3 due 



to a plane determined by the line 5j in view 2 and the center of projection of 
camera 2. Likewise, the third type is a homography from view 1 to 2 via a plane 
determined by the line 5k in view 3 and the center of projection of camera 3. 
These homography matrices were introduced in 1221 and are described in more 
detail in the appendix here. 

We wish to consider the same types of contractions on the bifocal tensor 
— by equivalence of form, we should obtain collineations and correlations as 
well. Consider the contraction 

SkTt 

for some arbitrary vector 5. By substitution in the definition of we obtain 

^jk 



SkFt = Fu 



[-5]> 



which in matrix form becomes [5] x F. Our question therefore is about the geomet- 
ric interpretation of this matrix (for an arbitrary 5) . Given the form-equivalence 
of the two tensors the answer is immediate: [5]xF is a homography matrix from 
view 1 to view 2 via a plane coincident with the center of projection O' of cam- 
era 2 and the line 5 in view 2. The family of such matrices over all choices of 5 
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corresponds to the family of homography matrices whose planes are coincident 
with O' . The family is spanned by three matrices (since <5 is spanned by three 
vectors), and for example, the three slices using h = (1, 0, 0), (0, 1, 0) or (0, 0, 1) 
will provide the basis for this subgroup of homography matrices. 

More formally, consider the plane tt defined by the point O' and the line b in 
view 2. Consider a point p in view 1 and the ray from the center of projection O 
of the first camera and the point p. The ray intersects tt at which projects to 
a point in view 2 which is coincident with the line 5 (by construction) . Let Sj 
be the epipolar line of p in view 2, thus p'sjSkPi^ = 0 because they provide a 
point+line+line configuration, and this holds for all points p (see Fig.P). Thus, 
the matrix maps view 1 onto points along the corresponding epipolar 

lines and is therefore a homography matrix, and since the projected points are 
collinear the rank of the matrix is 2. We have the following result: 

Theorem 1. The matrix is a homography matrix of rank 2 from view 1 

to view 2 due to the plane coincident with the center of projection of camera 2 
and the line S in view 2. 

Note that the theorem generalizes the observation due to m that [v']xF is 
a homography matrix. We see that this is true for any choice of skew-symmetric 
matrix [5] x . 

The same result applies for the contraction (with a change of sign). 

Note that with the trifocal tensor there is a difference between the contractions 
SjT/^ and in which the former produces a homography matrix from view 

1 to 3 and the latter produces a homography matrix from view 1 to 2. In the case 
of the bifocal tensor views 2 and 3 coincide thus the two types of contractions 
are equivalent. 

The remaining contraction type is In the case of the trifocal tensor 

the contraction produces a correlation matrix which maps the space of 

lines from view 2 to a set of collinear points (on the epipolar line of the point S 
in view I) in view 3. The transpose of that matrix is the same type of mapping, 
but from view 3 to view 2 (see Appendix). We should obtain something similar 
for the bifocal tensor and since view 2 and 3 coincide the matrix should 

map the space of lines in view 2 onto collinear points in view 2 that define the 
epipolar line, FS, of the point S. Indeed, by substitution we obtain: 

= (2) 

Thus, [Fh]xS for all lines s in view 2 is the point of intersection of the epipolar 
line FS and the line s. In other words, the matrix [T"5]x is the correlation matrix 
we described above. Note that the reason we have obtained a trivial mapping 
is due to fact that this type of contraction is associated with reconstruction 
for lines. The three matrices for (5 = (1, 0, 0), (0, 1, 0) and (0,0,1) are 

known to arise from considerations of matching lines across three views (cf. 
^;a28iio| h However, the relative camera positions cannot be recovered from 
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matching lines across two views only (only from matching points) , which is why 
the corresponding correlation matrices of the bifocal tensor become trivial. 

To summarize, the embedding of the fundamental matrix in trivalent tensor 
format (the bifocal tensor) provides a unified terminology of a “point+line+line” 
that applies for both the bifocal and trifocal relationships across multiple views. 
In particular, as is the case with the trifocal tensor, contractions of the bifocal 
tensor into reduced forms (matrices) have a geometric significance. The con- 
tractions properties of the bifocal tensor are listed in Table d We see a clear 
analogy to the type of resulting matrices (homography and correlations) one ob- 
tains from the same contractions applied to the trifocal tensor. Furthermore, the 
homography contraction provides the basis for all rank-2 homography matrices 
whose planes are coincident with the center of projection of camera 2. All linear 
combinations of the rank- 2 homography matrices are of the form for some 

vector (5. 



Table 1. The three types of contractions of the bifocal tensor (embedding of 
the fundamental matrix F as a trivalent tensor ), their matrix form, and the 
property they produce. Note that the first two contractions produce a homography 
matrix of a plane whose orientation is determined by the vector of contraction 

S. 



Contraction 


Matrix Form 


Result 




[S]xF 


Homography Matrix. 




[SUF 


Same as above. 




[FSU 


Trivial Correlation Mapping. 



4 The Primitive Homography Matrices 

We have seen that the family of matrices [5] x F parameterized by the choice of 
the vector S spans the family of homography matrices from view 1 to view 2 
due to the planes coincident with the center of projection O' of camera 2. The 
vector S determines the orientation of the plane and is the line of intersection of 
the plane and view 2. Since d is spanned by three vectors, say (1, 0, 0), (0, 1, 0) 
and (0,0, 1), the bifocal tensor contractions provide three distinct homography 
matrices that span the subgroup of homography matrices (those whose planes 
are coincident with O'). Since the entire group of all homography matrices lies 
in a 4 dimensional subspace m. i-e., spanned by 4 homography matrices whose 
planes do not all coincide with a single point, we must produce an additional 
homography matrix in order to complete the basis of the subgroup defined by 
[<5] X A to a full basis for the entire group. The elements (matrices) of the full basis 
will be called “primitive homographies” . The additional homography matrix we 
seek must therefore be associated with a plane coincident with the center of 
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projection O of camera 1 (and is therefore of rank 1). We have the following 
Lemma which is adapted from ca: 

Lemma 1. Given the fundamental matrix F and the epipole v' defined by v' = 
0, then the family of matrices v'6^ are homography matrices from view 1 to view 
2 due to planes coincident with the center of projection O of camera 1 and the 
vector S is the intersection line of the plane and view 1. 

Proof: Let A\,A 2 be any two homography matrices. Thus, A\p,A 2 p and v' 
are collinear for all points p in view 1. Let q ^ v, where v is the epipole in 
view 1 {Fv = 0), be some point in view 1 and let A be a scalar defined such 
that Aiq— \A 2 q = v' . Let FI = A\ — XA 2 be a homography matrix (because 
all homography matrices are closed under linear combinations). Clearly, since 
Hq = v' , then Hp = v' for all p (because Hv = v' as well). Thus FI = v'5^ for 
some vector 5. [] 

Therefore, as long as 5^ v 7^ 0, where v is the epipole in view 1 (i.e., Fv = 0), 
then the homography matrix v'5^ does not coincide with O' (only with O) 
and thus can be used to complete the full basis for the group of homography 
matrices. Without loss of generality assume that (1, 0, 0) is not coincident with u, 
thus we have a basis of 4 homography matrices Hi, H/i, denoted as “primitive 
homographies” , defined below: 

iL, = [e,]xF, z= 1,2,3 (3) 

Hi = v'ej (4) 

where Ci are the identity vectors: e\ = (1, 0, 0), 62 = (0, 1, 0) and 63 = (0, 0, 1). 

5 Applications Using Primitive Homography Matrices 

The primitive homography matrices are a useful tool for representing geometric 
data. We will consider two examples here, the first on obtaining a “quasi-metric” 
representation of 3D space from a pair of uncalibrated cameras, and the second 
on “triangulation” from 3 views. 

5.1 Quasi-Metric Reference Plane 

Let Pi, Pi, i = 1,...,N, be matching points in view 1 and 2 respectively. Given 
the fundamental matrix F and the epipole v' in view 2, then the 3D projec- 
tive representation of the object space points Pi can be described relative to a 
reference plane tt: 

Pi ^ A^pi + piv' = [AT,,v']Pi 

where A^^ is the homography matrix mapping view 1 onto view 2 due to the 
plane tt. The scalar pi represents the relative deviation of the point Pi from 
the plane tt and is called the “relative affine structure” m The choice of the 
plane tt determines the projective representation of object space. For purposes 
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of visualization, it is useful to choose tt such that it is situated “in-between” the 
space points making it possible to treat pi as simple depth variable. In other 
words, let At^ = we seek to solve for the scalar aj, j = 1, ...,4, that 

minimize: 

4 

'^{ajHj)pi ^ p' i= 1, ..., N 

which provides an over-determined linear set of equations. We will refer to tt 
as the “quasi-metric” plane. The choice of the quasi-metric plane provides a 
better chance that the projective viewing of the object (treating the coordi- 
nates Xi,Ui, Pi as Euclidean coordinates by the viewing program) will have less 
projective distortions than other choices. 

5.2 Triangulation from 3 Views 

Hartley and Sturm El considered the problem, they called ’’triangulation”, of 
modifying the locations of input matching points p, p' that are given with noise 
to new locations p,p' that satisfy p'^ Fp = 0 such that {p — p)^ + (p' — p')^ 
is minimized. The triangulation problem in 3 views can be stated in a similar 
manner: given p in view 1, the matching process produces an error in the matches 
in view 2 and 3. The input matches are p' and p" and we wish to find new matches 
p' ,p" with p such that the triplet p,p',p" satisfy the trilinear equations while 
(p' — p')'^ + {p" — p")^ is minimized. Note that we do not add an error term 
to p and rather take p as a reference. The reason for that is twofold: first due 
to the asymmetry of the trifocal tensor with respect to view ordering as it is 
defined with respect to a reference view (unlike the fundamental matrix which 
remains fixed under view ordering) . Secondly, in most matching approaches that 
use a correlation principle, like the popular Lucas-Kanade US! method with 
the coarse-to-fine implementation by Sarnoff Corp. there is also an intrinsic 
asymmetry that assumes one of the views as a reference. Taken together, we can 
without loss of generality assume that the effect of error in the matching process 
is represented in the displacement of p' and p" from their true locations p' ,p" ■ 

The triangulation process using the trifocal tensor can proceed as follows. 
We first note that the following relationship exists: 

p' = Ap + pv' (5) 

p" = Bp + pv” (6) 

where A, B are two homography matrices from view 1 to 2 and from view 1 to 
3 via some reference plane tt (any plane). Given the trifocal tensor one can 
recover the epipoles v' ,v” (and fundamental matrices) jl 1)|22] and proceed to 
recover a pair of homography matrices A, B as described below. 

One can solve for A be either choosing some linear combination of the prim- 
itive homographies or solving for the quasi-metric plane as described in the 
previous section. Thus we can assume that A is known. The corresponding ho- 
mography B cannot be chosen arbitrarily because it must be associated with the 
same plane tt that was associated with the homography A. 
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Let Hi, I = 1, be the primitive homographies from view 1 to 3. Let the 
sought after matrix B be represented hy B — J2i PiHi- We seek a solution of the 
scalars j3i. We have the following relationship: 

= v'^St,(3iHi)^-Xv">^ai (7) 



where the left-hand side is known (the trifocal tensor) and the right-hand side 
contains 5 unknowns which together form an over-determined linear system. The 
scalar A fixes the scale because v',v",A are all determined up to scale. Taken 
together, from the trifocal tensor and with the use of the primitive homographies 
we can extract a set of compatible homographies (associated with the same 
plane) and the epipoles v' , v" . 

We are now left with minimizing the following expression: 



min{ {x' 

p 



alp+pv'i ^2 I /y alP+PV2 -.2 I 
ajp+pv'j ajp+pv'j 



bjp+pv'l 2 I /y/ b 2 P+PV 2 ^ 2 ^ 

bjp+pv'^ hjp+pv'^ 



which is minimized with respect to p. This yields a 4th order polynomial in 
p which thus has a closed-form solution. The geometric interpretation of this 
minimization process is that the solution p determines the points p' ,p" on their 
corresponding epipolar lines such that the distance {p' — p'Y + ip" ~ P"Y is 
minimized. Note that unlike the case of two views, one cannot place p' and 
p" anywhere on their epipolar lines because they are coupled together by a 1- 
parameter degree of freedom. In particular, the projections of p' and p" on their 
epipolar lines may not be an admissible solution. 



6 Other Applications of the Bifocal Tensor 
Representation 

In the previous sections we presented applications of the primitive homogra- 
phies which in turn are due to the discovery of [5] x F representing the family 
of rank-2 homographies which in turn are due to the bifocal tensor representa- 
tion. However, one could possibly re-derive the result [<5]x^ from purely matrix 
considerations without relying on the bifocal tensor. Nevertheless, there are ap- 
plications that critically rely on the tensor embedding of the fundamental matrix 
in the form of the bifocal tensor — and in this section we briefly discuss two of 
them. 



6.1 View- Synthesis 

The notion of image-based rendering is gaining momentum both in the computer 
graphics and computer vision communities. Using the trifocal tensor for image- 
based rendering was proposed by |2|. In a nutshell, the method links together 
two real views of a 3D scene with a third virtual view of the scene. The tensor is 
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Fig. 2. View synthesis is divided into two parts. In the pre-processing stage, 
done only once, we compute the dense correspondence and the bifocal tensor. 
The rendering stage, done for every novel image, transforms the “seed” bifocal 
tensor to a general three-view trifocal tensor, using user-specified parameters 
R, t and renders the novel view using the transformed tensor, the model images 
and the dense correspondence. 



then used to reproject a point appearing in the first two views directly onto the 
virtual view, without ever recovering 3D structure. Moving the virtual camera 
in space is done by modifying the tensor to reflect the change in the relative 
position of the virtual view. To bootstrap the seed tensor one would need three 
real views of the object, but only two of them will be later used for the generation 
of the virtual view. However, using the tensor-embedded fundamental matrix, 
one can use only two real images to generate the bifocal “seed tensor” . 

One starts with the bifocal tensor which is then transformed using the user 
specified motion of the virtual camera to the appropriate trifocal tensor (of the 
original two model views and the virtual view to be synthesized). From there 
on the trifocal tensors transform as the virtual camera changes positions (see 
Fig. El). Thus, for this application to work it is necessary to have a uniform 
terminology for handling 2 and 3 views. 



6.2 Ego- mot ion Recovery 

When considering the problem of recovering the camera ego-motion (projection 
matrices) from a stream of views, one faces the problem of maintaining a con- 
sistency of pairwise fundamental matrices. The consistency requirement arises 
from the simple fact that from an algebraic standpoint a camera trajectory must 
be concatenated from pairs or triplets of images. Therefore, a sequence of inde- 
pendently computed fundamental matrices or trifocal tensors, maybe optimally 
consistent with the image data, but not necessarily consistent with a unique 
camera trajectory (see Figure 0). 

The consistency problem can be approached by introducing the following 
equation which relates the trifocal tensor between views 1,2,3 and the bifocal 
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Example in 4 Images 




Fig. 3. One can compute two tensors Ti23,T234 from the four images of the 
3D scene. However, each tensor can give rise to a different reconstruction of the 
3D structure due to noise or errors in measurements, and therefor the camera 
trajectory between images 2 and 3, as captured by the fundamental matrix F23, 
is inconsistent between the two tensors. Figure taken from 0 



tensor between views 1,2 and the elements of the fundamental matrix between 
views 2,3: 

= (8) 

where is the tensor of views 1,2,3, the matrix A, whose elements are a{, is 
a homography from views 1 to 2 via some arbitrary plane tt, is the bifocal 
tensor of views 1,2, and C = \C]v"'] is the camera motion from view 2 to 3 
where cf is a homography matrix from view 2 to 3 via the (same) plane tt. 

As a result, given the fundamental matrix between views 1,2 and (at least) 6 
matching points between views 1,2,3 one can solve for the fundamental matrix 
between views 2 and 3 (i.e., \v''']y,C) which is consistent with the trifocal rela- 
tionship among views 1,2,3. Also, as a byproduct, the projection matrix [C,v”’\ 
is consistent with the same projective representation due to the fact that the ho- 
mographies A, C are of the same reference plane. The details and demonstration 
of this idea can be found in . 



7 Summary 

We have introduced a new representation of the bilinear matching constraint 
between a pair of views in terms of a 3 x 3 x 3 tensor which we termed the 
“bifocal” tensor. The motivation for the new representation is to establish a 
unified terminology between the elements of 2-view and 3-view constraints. The 
unified terminology is achieved by representing the 2-view constraint in a way 
analogously (and identical in form) to the trifocal tensor relationship. As a result, 
we were able to transfer the properties known today about the trifocal tensor 
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(especially the contraction into homography matrices) to the realm of the 2- view 
case. 

The byproduct of the new representation is twofold. First, we have derived 
the family of rank-2 homography matrices represented by [<5] x F and introduced 
the “primitive homographies” and their applications. Second, we mentioned two 
other applications for which the unified terminology is necessary. 

Taken together, it is useful to have a common language for analyzing the 
geometric constraints arising from multiple-view geometry — both at the theo- 
retical level for purposes of obtaining a clean representation and for applications 
where the common language is sometimes necessary (as was shown in Section 0. 
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A Appendix 

A.l Trilinearities and the Trifocal Tensor 



Three views, p = [l;0]x,p' = Ax and p" = Bx, are known to produce four 
trilinear forms whose coefficients are arranged in a tensor representing a bilinear 
function of the camera matrices A, B: 

= ( 1 ) 



where A = \al,v'^] {aj is the 3x3 left minor and v' is the fourth column of 
A) and B = \b^,v"^]. The tensor acts on a triplet of matching points in the 
following way: 

p^sy^V^ = 0 ( 2 ) 

where are any two lines (sj and s|) intersecting at p', and are any two lines 
intersecting p" . Since the free indices are p,p each in the range 1,2, we have 4 
trilinear equations (unique up to linear combinations) . If we choose the standard 
form where (and r^) represent vertical and horizontal scan lines, i.e.. 




-1 0 x' 
0 -Ip' 



then the four trilinear forms, referred to as trilinearities nil, have the following 
explicit form: 



y'V¥ 

x"T^^P^ 



x"x'T^^p^ + _ ^11^* ^ 

p"x'7)3V + = 0, 

y"y'T^^p^ + y'T^Y-r^V = 0- 



These constraints were first derived in H2I; the tensorial derivation leading to 
eqns.Q]and0was first derived in The tensor is often referred to as “trilinear” 
or “trifocal”, and we adopt here the term trifocal tensor. The trifocal tensor has 
been well known in disguise in the context of Euclidean line correspondences and 
was not identified at the time as a tensor but as a collection of three matrices (a 
particular contraction of the tensor, correlation contractions, as explained next) 
[l2:il24l2Hj . The link between the trilinearities and the matrices of line geometry 
was identified later by Hartley |t)l 1 1 )j . Additional work in this area can be found 



m n mm m] si fAm ts ua s ksi 



The tensor has certain contraction properties and can be sliced in three prin- 
cipled ways into matrices with distinct geometric properties. These properties is 
what makes the tensor distinct from simply being a collection of three matrices 
and will be briefly discussed next — further details can be found in EM. 
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A. 2 Contraction Properties and Tensor Slices 

Consider the matrix arising from the contraction, 

( 3 ) 

which is a 3 X 3 matrix, we denote by E, obtained by the linear combination 
E = 5\T^^ + (which is what is meant by a contraction), and 5k is 

an arbitrary covariant vector. The matrix E has a general meaning introduced 
in 1 ^: 

Proposition 1 (Homography Contractions). The contraction for 

some arbitrary 5k is a homography matrix from image one onto image two de- 
termined by the plane containing the third camera center C" and the line 5k in 
the third image plane. Generally, the rank of E is 3. Likewise, the contraction 
5jEf^ is a homography matrix from image one onto image three. 

For proof see Clearly, since 5 is spanned by three vectors, we can generate 
up to at most three distinct homography matrices by contractions of the tensor. 
We define the Standard Homography Slicing as the homography contractions as- 
sociated by selecting 5 be (1, 0, 0) or (0, 1, 0) or (0, 0, 1), thus the three standard 
homography slices between image one and two are TE ,Tf‘^ and and we 
denote them by E\,E 2 , E^ respectively, and likewise the three standard homog- 
raphy slices between image one and three are and and we denote 

them by W\ , W 2 , W 3 respectively. 

Similarly, consider the contraction 

( 4 ) 

which is a 3 X 3 matrix, we denote by T, and where is an arbitrary contravariant 
vector. The matrix T has a general meaning is well, as detailed below CHI: 

Proposition 2. The contraction 5'‘Tf^ for some arbitrary 5'‘ is a rank 2 corre- 
lation matrix from image two onto image three, that maps the dual image plane 
(the space of lines in image two) onto a set of collinear points in image three 
that form the epipolar line corresponding to the point 5^ in image one. The null 
space of the correlation matrix is the epipolar line of in image two. Similarly, 
the transpose of T is a correlation from image three onto image two with the 
null space being the epipolar line in image three corresponding to the point 5* in 
image one. 

For proof see m- We define the Standard Correlation Slicing as the corre- 
lation contractions associated with selecting 5 be (1,0,0) or (0,1,0) or (0,0,1), 
thus the three standard correlation slices are and and we denote 

them by Ti, T 2 , T 3 , respectively. The three standard correlations date back to the 
work on structure from motion of lines across three views I23I28I where these 
matrices were first introduced. 
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Abstract. We describe work in progress on a numerical library for es- 
timating multi-image matching constraints, or more precisely the multi- 
camera geometry underlying them. The library will cover several vari- 
ants of homographic, epipolar, and trifocal constraints, using various 
different feature types. It is designed to be modular and open-ended, so 
that (i) new feature types or error models, (ii) new constraint types or 
parametrizations, and {in) new numerical resolution methods, are rel- 
atively easy to add. The ultimate goal is to provide practical code for 
stable, reliable, statistically optimal estimation of matching geometry un- 
der a choice of robust error models, taking full account of any nonlinear 
constraints involved. More immediately, the library will be used to study 
the relative performance of the various competing problem parametriza- 
tions, error models and numerical methods. The paper focuses on the 
overall design, parametrization and numerical optimization issues. The 
methods described extend to many other geometric estimation problems 
in vision, e.g. curve and surface fitting. 

Keywords: Matching constraints, multi-camera geometry, geometric fit- 
ting, statistical estimation, constrained optimization. 



1 Introduction and Motivation 

This paper describes work in progress on a numerical library for the estimation 
of multi-image matching constraints. The library will cover several variants of 
homographic, epipolar, and trifocal constraints, using various common feature 
types. It is designed to be modular and open-ended, so that new feature types 
or error models, new constraint types or parametrizations, and new numerical 
resolution methods are relatively easy to add. The ultimate goal is to provide 
practical code for stable, reliable, statistically optimal estimation of matching 
geometry under a choice of robust error models, taking full account of any nonlin- 
ear constraints involved. More immediately, the library is being used to study the 
relative performance of the various competing problem parametrizations, error 
models and numerical methods. Key questions include: (*) how much difference 
does an accurate statistical error model make; {ii) which constraint parametriza- 
tions, initialization methods and numerical optimization schemes offer the best 
reliability/speed/simplicity. The answers are most interesting for near- degenerate 
problems, as these are the most difficult to handle reliably. This paper focuses on 
architectural, parametrization and numerical optimization issues. I have tried to 
give an overview of the relevant choices and technology, rather than going into 
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too much detail on any one subject. The methods described extend to many 
other geometric estimation problems, such as curve and surface fitting. 

After motivating the library and giving notation in this section, we develop a 
general statistical framework for geometric fitting in §2 and discuss parametriza- 
tion issues in §3. §4 summarizes the library architecture and numerical tech- 
niques, §5 discusses experimental testing, and §6 concludes. 

Why study matching constraint estimation? — Practically, matching 
constraints are central to both feature grouping and 3D reconstruction, so better 
algorithms should immediately benefit many geometric vision applications. But 
there are many variations to implement, depending on the feature type, number 
of images, image projection model, camera calibration, and camera and scene 
geometry. So a systematic approach seems more appropriate than an ad hoc case- 
by-case one. Matching constraints also have a rather delicate algebraic structure 
which makes them difficult to estimate accurately. Many common camera and 
scene geometries correspond to degenerate cases whose special properties need 
to be detected and exploited for stability. Even in stable cases it is not yet clear 
how best to parametrize the constraints — usually, they belong to fairly com- 
plicated algebraic varieties and redundant or constrained parametrizations are 
required. Some numerical sophistication is needed to implement these efficiently, 
and the advantages of different models and parametrizations need to be studied 
experimentally: the library is a vehicle for this. 

It is also becoming clear that in many cases no single model suffices. One 
should rather think in terms of a continuum of nested models linked by spe- 
cialization/generalization relations. For example, rather than simply assuming a 
generic fundamental matrix, one should use inter-image homographies for small 
camera motions or large flat scenes, affine fundamental matrices for small, dis- 
tant objects, essential matrices for constant intrinsic parameters, fundamental 
matrices for wide views of large close objects, lens distortion corrections for 
real images, etc. Ideally, the model should be chosen to maximize the statisti- 
cally expected end-to-end system performance, given the observed input data. 
Although there are many specific decision criteria (ML, AIC, BIC, ... ), the 
key issue is always the bias of over-restrictive models versus the variability of 
over-general ones with superfluous parameters poorly controlled by the data. 
Any model selection approach requires several models to be fitted so that the 
best can be chosen. Some of the models must always be inappropriate — either 
biased or highly variable — so fast, reliable, accurate fitting in difficult cases is 
indispensable for practical model selection. 

Terminology and notation: We use homogeneous coordinates through- 
out, with upright bold for 3D quantities and italic bold for image ones. Image 
projections are described by 3 x 4 perspective projection matrices P, with 
specialized forms for calibrated or very distant cameras. Given m images of a 
static scene, our goal is to recover as much information as possible about the 
camera calibrations and poses, using only image measurements. We will call 
the recoverable information the inter-image geometry to emphasize that no 
explicit 3D structure is involved. The ensemble of projection matrices is de- 
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fined only up to a 3D coordinate transformation (projectivity or similarity) T: 
(Pi,... ,Pm) ^ (PiT, . . . ,PmT). We call such coordinate freedoms gauge 
freedoms. So our first representation of the inter-image geometry is as projec- 
tion matrices modulo a transformation group. In the uncalibrated case 
this gives an 11m parameter representation with 15 gauge freedoms, leaving 
11m — 15 essential d.o.f. (= 7, 18, 29 for m = 2,3,4). In the calibrated case there 
are 6m — 7 essential degrees of freedom. 

Any set of four (perhaps not distinct) projection matrices can be combined to 
form a matching tensor iniisi — a multi-image object independent of the 3D 
coordinates. The possible types are: epipoles ef; 3x3 fundamental matrices 
Fij-, 3x3x3 trifocal tensors and 3 x 3 x 3 x 3 quadrifocal tensors 
H^ki ^ xheir key property is that they are the coefficients of inter-image match- 
ing constraints — the consistency relations linking corresponding features in 
different images. E.g., for images x,x',x” of a 3D point we have the 2-image 
epipolar constraint x^ F x' — 0; the 3-image trinocular constraint which 
can be written symbolically as [x' ]x{G ■ x)[x" — 0 where [a;]x is the ma- 

trix generating the cross product [x]xy = xAy; and a 4-image quadrinocular 
constraint. The matching tensors also characterize the inter-image geometry. 
This is attractive because they are intimately connected to the image measure- 
ments — it is much easier to get linearized initial estimates of matching tensors 
than of projection matrices. Unfortunately, this linearity is deceptive. Match- 
ing tensors are not really linear objects: they only represent a valid, realizable 
inter-image geometry if they satisfy a set of nonlinear algebraic consistency 
constraints. These rapidly become intractable beyond 2-3 images, and are still 
only partially understood m CH 0 El E| Our second parametrization of the 
inter-image geometry is as matching tensors subject to consistency constraints. 

We emphasize that camera matrices or matching tensors are only a means 
to an end: it is the underlying inter-image geometry that we are really trying 
to estimate. Unfortunately, this is abstract and somewhat difficult to pin down 
because it is a nontrivial algebraic variety — there are no simple, minimal, 
global parametrizations. 

2 Optimal Geometric Fitting 

2.1 Direct Approach 

Matching constraint estimation is an instance of an abstract geometric fitting 
problem which also includes curve and surface fitting and many other geometric 
estimation problems: estimate the parameters of a model u defining implicit 
constraints cfyxi,u) = 0 on underlying features x^, from noisy measurements of 
the features. More specifically we assume: 

1. There are unknown true underlying features x^ and an unknown true 
underlying model u which exactly satisfy implicit model-feature con- 
sistency constraints Ci(xi,u) = 0. (For matching constraint estimation, 
these ‘features’ are actually ensembles of several corresponding image ones). 
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2. Each underlying feature is linked to observations or other prior in- 
formation by an additive posterior statistical error measure pi(xi) = 
Pi(xi |xi). For example, pi might be (robustified, bias corrected) posterior 
log likelihood. There may also be a model prior Pprior(u). These distri- 
butions are independent. 

3. The model parametrization u may itself be complex, e.g. with internal con- 
straints k(u) = 0, gauge freedoms, etc. 

4. We want to find optimal consistent point estimates (xi,u) of the true 
underlying model u and features Xi 






, u) = arg min Pprior(u) + Pi(xj|x^) 



Ci(xi,u) = 0, k(u) = 0 



Consistent means that (xi,u) exactly satisfy all the constraints. Optimal 
means that they minimize the total error over all such estimates. Point 
estimate means that we are attempting to “summarize” the joint posterior 
distribution p(xi, . . . , u|xj, . . . ) with just the few numbers (x^, . . . , u). 

We call this the direct approach to geometric fitting because it involves direct 
numerical optimization over the “natural” variables (xj,u). Its most important 
characteristics are: (z) It gives exact, optimal results — no approximations are 
involved, (ii) It produces optimal consistent estimates x^ of the underlying fea- 
tures Xi . These are useful whenever the measurements need to be made coherent 
with the model. For matching constraint estimation such feature estimates are 
“pre-triangulated” or “implicitly reconstructed” in that they have already been 
made exactly consistent with exactly one reconstructed 3D feature. (Hi) Natural 
variables are used and the error function is relatively simple, typically just a sum 
of (robustified, covariance weighted) squared deviations ||xj — Xj|p. (iv) However, 
a sparse constrained nonlinear optimization routine is required: the problem is 
large, constrained and usually nonlinear, but the features couple only to the 
model, not to each other. 

As an example, for the uncalibrated epipolar geometry: the “features” are 
pairs of corresponding underlying image points the “model” u is the 

fundamental matrix F subject to the consistency constraint det(J’) = 0; the 
“model- feature constraints” are the epipolar constraints xj F x[ = 0; and the 
“feature error model” pi(xi) might be (a robustified, covariance- weighted variant 
of) the squared feature-observation distance \\x — + \\x' — . 



2.2 Reduced Approach 

If explicit estimates of the underlying features are not required, one can attempt 
to replace step 4 above with an optimization over u alone: 

4'. Find an optimal consistent point estimate u of the true underlying 
model u 



u = arg min Pprior(u) + Pi(u|x^) 



k(u) = 0 
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Here, the reduced error functions pi(u|xj) are obtained by freezing u 
and eliminating the unknown features from the problem using either: (i) point 
estimates Xi(xj,u) = arg min (pi(xi|xj) | Ci(xi,u) = 0 ) of x^ given x^ and 
u, with pi(u|xj) = pi(xi(Xj, u)|xj); or (ii) marginalization with respect to 
x^: pi(u|Xj) = u)=o These two methods are not equivalent 

in general, although their answers happen to agree in the linear /Gaussian limit. 
But both represent reasonable estimation techniques. 

We call this the reduced approach to geometric fitting, because the prob- 
lem is reduced to one involving only the model parameters u. The main advan- 
tage is that the optimization is over relatively few variables u. The constraints 
do not appear, so a non-sparse and (perhaps) unconstrained optimization routine 
can be used. The disadvantage is that the reduced cost p(u) is seldom available 
in closed form. Usually, it can only be evaluated to first order in a linearized -I- 
central distribution approximation. In fact, the direct method (with u frozen, 
and perhaps limited to a single iteration) is often the easiest way to evaluate 
the point-estimate-based reduced cost. The only real difference is that the direct 
method explicitly calculates and applies feature updates dx^, while the reduced 
method restarts each time from x^ = But the feature updates are relatively 
easy to calculate given the factorizations needed for cost evaluation, so it seems 
a pity not to use them. 

The first order reduced cost can be estimated in two ways, either («) directly 
from the definition by projecting x^ Mahalanobis-orthogonally onto the local 
first-order constraint surface c^-l-^^-dxi = 0; or (ii) by treating Ci = Ci(Xj, u) as 
a random variable, using covariance propagation w.r.t. x^ to find its covariance, 
and calculating the y^-like variable c/Cov(ci)“^Ci. In either case we obtain the 
gradient weighted least squares cost functior^^ dl 



P(u) = 



dcj 

dx, 



d^Pi 

dx^ 



-1 



dcj 

dxi 



-1 






This is simplest for problems with scalar constraints. E.g. for the uncalibrated 
epipolar constraint we get the well-known form mu 



p(u) = 






■ F Cov(^') F-^ 






F^Cov{x^)F^^ 



2.3 Robustification — Total Distribution Approach 

Outliers are omnipresent in vision data and it is essential to protect against 
them. In general, they are distinguished only by their failure to agree with the 
consensus established by the inkers, so one should really think in terms of inlier 
or coherence detection. The hardest part is establishing a reliable initial esti- 
mate, i.e. the combinatorial problem of finding enough inkers to estimate the 



^ If any of the covariance matrices is singular (which happens for redundant constraints 
or homogeneous data Xi), the matrix inverses can be replaced with pseudo-inverses. 
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model, without being able to tell in advance that they are inliers. Exhaustive 
enumeration is usually impracticable, so one falls back on either RANSAC-like 
random sampling or (in low dimensions) Hough-like voting. Initialization from 
an outlier-polluted linear estimate is seldom completely reliable. 

Among the many approaches to robustness, I prefer M-like estimators and 
particularly the total distribution approach: hypothesize a parametric form 
for the total observation distribution — i.e. including both inliers and out- 
liers — and fit this to the data using some standard criterion, e.g. maximum 
likelihood. No explicit inlier/outlier decision is needed: the correct model is lo- 
cated simply because it provides an explanation more probable than randomness 
for the coherence of the inlierfl The total approach is really just classical para- 
metric statistics with a more realistic or “robust” choice of parametric family. 
Any required distribution parameters can in principle be estimated during fitting 
{e.g. covariances, outlier densities). For centrally peaked mixtures one can view 
the total distribution as a kind of M-estimator, although it long predates these 
and gives a much clearer meaning to the rather arbitrary functional forms usu- 
ally adopted for them. As with other M-like-estimators, the estimation problem 
is nonlinear and numerical optimization is required. With this approach, both 
of the above geometric fitting methods are ‘naturally’ robust — we just need to 
use an appropriate total likelihood. 

Reasons for preferring M-like estimators over trimmed ones like RANSAC’s 
consensus and rank-based ones like least median squares include: (i) to the ex- 
tent that the total distribution is realistic, the total approach is actually the 
statistically optimal one; (ii) only M-like cost functions are smooth and hence 
easy to optimize; (Hi) the ‘soft’ transitions of M-like estimators allow better 
use of weak ‘near outlier’ data, e.g. points which are relatively uncertain ow- 
ing to feature extraction problems, or “false outliers” caused by misestimated 
covariances or a skewed, biased, or badly initialized model; {iv) including an ex- 
plicit covariance scale makes the results more reliable and increases the expected 
breakdown point — ‘scale free’ rank based estimators can not tell whether the 
measurements they are including are “plausible” or not; (w) all of these esti- 
mators assume an underlying ranking of errors ‘by relative size’, and none are 
robust against mismodelling of this — rank based estimators only add a little 
extra robustness against the likelihood vs. error size assignment. 



3 Parametrizing the Inter-image Geometry 

As discussed above, what we are really trying to estimate is the inter-image 
geometry — the part of the multi-camera calibration and pose that is recover- 
able from image measurements alone. However, this is described by a nontrivial 
algebraic variety — it has no simple, minimal, concrete, global parametrization. 

^ If the total distribution happens to be an inlier/outlier mixture — e.g. Gaussian peak 
+ uniform background — posterior inlier/outlier probabilities are easily extracted 
as a side effect. 
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For example, the uncalibrated epipolar geometry is “the variety of all homo- 
graphic mappings between line pencils in the plane” , but it is unclear how best 
to parametrize this. We will consider three general parametrization strategies for 
algebraic varieties: (i) redundant parametrizations with internal gauge freedoms; 
(ii) redundant parametrizations with internal constraints; (Hi) overlapping local 
coordinate patches. Mathematically these are all equivalent — they only differ in 
relative convenience and numerical properties. Different methods are convenient 
for different uses, so it is important to be able to convert between them. Even 
the numerical differences are slight for strong geometries and careful implemen- 
tations, but for weak geometries there can be significant differences. 



3.1 Redundant Parametrizations with Gauge Freedom 

In many geometric problems, arbitrary choices of coordinates are required 
to reduce the problem to a concrete algebraic form. Such choices are called 
gauge freedoms — ‘gauge’ just means coordinate system. They are associated 
with an internal symmetry or coordinate transformation group and its 
representations. Formulae expressed in gauged coordinates reflect the symmetry 
by obeying well-defined transformation rules under changes of coordinates, i.e. 
by belonging to well-defined group representations. 3D Cartesian coordinates 
are a familiar example: the gauge group is the group of rigid motions, and the 
representations are (roughly speaking) Cartesian tensors. 

Common gauge freedoms include: (i) 3D projective or Euclidean coordinate 
freedoms in reconstruction and projection-matrix-based camera parametriza- 
tions; (ii) arbitrary homogeneous-projective scale factors; and (Hi) choice-of- 
plane freedoms in homographic parametrizations of the inter-image geom- 
etry. These latter represent matching tensors as products of epipoles and inter- 
image homographies induced by an arbitrary 3D plane. The gauge freedom is the 
3 d.o.f. choice of plane. The fundamental matrix can be written as F ~ [ e ]xR 
where e is the epipole and H is any inter-image homography HHEl. Redefining 
the 3D plane changes H to H + e a'^ for some image line 3- vector a. This leaves 
F unchanged, as do rescalings e ^ Ae, H pH. So there are 3 -I- 1 -I- 1 gauge 
freedoms in the 3 -I- 3 x 3 = 12 variable parametrization F ~ F(e, H), leaving 
the correct 12 — 5 = 7 degrees of freedom of the uncalibrated epipolar geometry. 
Similarly 0, the image (1, 2, 3) trifocal tensor G can be written in terms of the 
epipoles (e', e") and inter-image homographies (H', H") of image 1 in images 
2 and 3 

G ~ e' ® H" — H' ® e" with freedom ^ (h") ie") 

The gauge freedom corresponds to the choice of 3D plane and 3 scale d.o.f. — the 
relative scaling of (e' , H') vs. (e", H") being significant — so the 18 d.o.f. of the 
uncalibrated trifocal geometry are parametrized by3-|-3-|-9-|-9 = 24 parameters 
modulo 3-|-l-l-l-l-l = 6 gauge freedoms. For calibrated cameras it is useful to 
place the 3D plane at infinity so that the resulting absolute homographies are 
represented by 3 x 3 rotation matrices. This gives well-known 6 and 12 parameter 
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representations of the calibrated epipolar and trifocal geometries, each with just 
one redundant scale d.o.f.: E ^ [e]xR, <? — e' ® R" — R' ® e" . All of these 
homography + epipole parametrizations can also be viewed as projection matrix 
based ones, in a 3D frame where the first projection takes the form (IsxslO). 
The plane position freedom a corresponds to the 3 remaining d.o.f. of the 3D 
projective frame |S|. These methods seem to be a good compromise: compared 
to ‘free’ projections, they reduce the number of extraneous d.o.f. from 15 to 3. 
However their numerical stability does depend on that of the key image. 

Gauged parametrizations have the following advantages: («) they are very 
natural when the inter-image geometry is derived from the 3D one; (ii) they are 
close to the underlying geometry, so it is relatively easy to derive further proper- 
ties from them (projection matrices, reconstruction methods, matching tensors); 
(Hi) a single homogeneous coordinate system covers the whole variety; (iv) they 
are numerically fairly stable. Their main disadvantage is that they include ex- 
traneous, strictly irrelevant degrees of freedom which have no effect at all on the 
residual error. Hence, gauged Jacobians are exactly rank deficient: specially sta- 
bilized numerical methods are needed to handle them. The additional variables 
and stabilization also tend to make gauged parametrizations slow. 

3.2 Constrained Parametrizations 

Another way to define a variety is in terms of consistency constraints that 
“cut the variety out of” a larger, usually linear space. Any coordinate system in 
the larger space then parametrizes the variety, but this is an over-parametrization 
subject to nonlinear constraints. Points which fail to satisfy the constraints have 
no meaning in terms of the variety. Matching tensors are the most famil- 
iar example. In the 2- and 3-image cases a single fundamental matrix or trifocal 
tensor suffices to characterize the inter-image geometry. But this is a linear over- 
parametrization, subject to the tensor’s nonlinear consistency constraints — only 
so is a coherent, realizable inter-image geometry represented. Such parametriza- 
tions are valuable because they are close to the image data, and (inconsistent!) 
linear initial estimates of the tensors are easy to obtain. Their main disadvan- 
tages are: (i) the consistency conditions rapidly become complicated and non- 
obvious; (ii) the representation is only implicit — it is not immediately obvious 
how to go from the tensor to other properties of the geometry such as projec- 
tion matrices. The first problem is serious and puts severe limitations on the 
use of (ensembles of) matching tensors to represent camera geometries, even in 
transfer-type applications where explicit projection matrices are not required. 
Three images seems to be about the practical limit if a guaranteed-consistent 
geometry is required, although — at the peril of a build-up of rounding error — 
one can chain together a series of such three image solutions [I2UI3ID. 

For the fundamental matrix the codimension is 1 and the consistency con- 
straint is det(F’) = 0 — this is perhaps the simplest of all representations of the 
uncalibrated epipolar geometry. For the essential matrix E the codimension is 3, 
spanned either by the requirement that E should have two equal (which counts 
for 2) and one zero singular values, or by a local choice of 3 of the 9 Demazure 
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constraints {EE"^ — ^tra,ce{EE'^)) E = 0 For the uncalibrated trifocal tensor 
G we locally need 26 — 18 = 8 linearly independent constraints. Locally (only!) 
these can be spanned by the 10 determinantal constraints g^det((j • x) = 0 
— see |B| for several global sets. For the quadrifocal tensor H the codimension 
is 80 — 29 = 51 which is locally (but almost certainly not globally) spanned by 
the 3! • 3 • 3 = 54 determinantal constraints detij(Ff®-'*^*) — 0 + permutations. 

Note that the redundancy and complexity of the matching tensor represen- 
tation rises rapidly as more images or calibration constraints are added. Also, 
constraint redundancy is common. Many algebraic varieties require a num- 
ber of generators greater than their codimension. Intersections of the minimal 
number of polynomials locally give the correct variety, but typically have other, 
unwanted components elsewhere in the space. Extra polynomials must be in- 
cluded to suppress these, and it rapidly becomes difficult to say which sets of 
polynomials are globally sufficient. 

3.3 Local Coordinate Patches / Minimal Parametrizations 

Both gauged and constrained parametrizations are redundant and require spe- 
cialized numerical methods. Why not simplify life by using a minimal set of 
independent parameters? — The basic problem is that no such parametriza- 
tion can cover the whole of a topologically nontrivial variety without singulari- 
ties. Minimal parametrizations are intrinsically local: to cover the whole variety 
we need several such partially overlapping ‘local coordinate patches’, and also 
code to select the appropriate patch and manage any inter-patch transitions that 
occur. This can greatly complicate the optimization loop. 

Also, although infinitely many local parametrizations exist, they are not usu- 
ally very ‘natural’ and finding one with good properties may not be easy. Ba- 
sically, starting from some ‘natural’ redundant representation, we must either 
come up with some inspired nonlinear change of variables which locally removes 
the redundancy, or algebraically eliminate variables by brute force using con- 
sistency or gauge fixing constraints. For example, Luong et al cni guarantee 
det(i^) = 0 by writing each row of the fundamental matrix as a linear com- 
bination of the other two. Each parametrization fails when its two rows are 
linearly dependent, but the three of them suffice to cover the whole variety. In 
more complicated situations, intuition fails and we have to fall back on alge- 
braic elimination, which rapidly leads to intractable results. Elimination-based 
parametrizations are usually highly anisotropic: they do not respect the sym- 
metries of the underlying geometry. This tends to mean that they are messy to 
implement, and numerically ill-behaved, particularly near the patch boundaries. 

The above comments apply only to algebraically derived parametrizations. 
Many of the numerical techniques for gauged or constrained problems eliminate 
redundant variables numerically to first order, using the constraint Jacobians. 
Such local parametrizations are much better behaved because they are always 
used at the centre of their valid region, and because stabilizing techniques like 
pivoting can be used. It is usually preferable to eliminate variables locally and 
numerically rather than algebraically. 
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4 Library Architecture and Numerical Methods 

The library is designed to be modular so that different problems and approaches 
are easy to implement and compare. We separate: (z) the matching geometry 
type and parametrization; {ii) each contributing feature-group type, parametriza- 
tion and error model; (Hi) the numerical optimization method, and its associ- 
ated linear algebra; (iv) the search controller (step acceptance and damping, 
convergence tests). This decomposition puts some constraints on the types of 
algorithms that can be implemented, but these do not seem to be too severe in 
practice. Modularization also greatly simplifies the implementation. 

Perhaps the most important assumption is the adoption throughout of a 
“square root” or normalized residual vector based framework, and the asso- 
ciated use of Gauss-Newton techniques. Normalized residual vectors are 
quantities for which the squared norm ||ei|p — or more generally a ro- 
bust, nonlinear function /9i(||ei|p) — is a meaningful statistical error measure. 
E.g. Bi(xi) = Cov(xj)“ 2 (xj — Xi)- This allows a nonlinear-least-squares-like ap- 
proach. Whenever possible, we work directly with the residual e and its Jaco- 
bian ^ rather than with ||e|p, its gradient ^ and its Hessian 

^ -I- We use the Gauss-Newton approximation, i.e. 

,2 

we discard the second derivative term in the Hessian. This buys us sim- 

plicity (no second derivatives are needed) and also numerical stability because 
we can use stable linear least squares methods for step prediction: by de- 
fault we use QR decomposition with column pivoting of rather than 
Cholesky decomposition of the normal matrix This is potentially slightly 

slower, but for ill-conditioned Jacobians it has much better resistance to round- 
ing error. (The default implementation is intended for use as a reference, so it 
is deliberately rather conservative). The main disadvantage of Gauss-Newton is 

that convergence may be slow if the problem has both large residual and strong 

,2 

nonlinearity — i.e. if the ignored Hessian term is large. However, geo- 

metric vision problems usually have small residuals — the noise is usually much 
smaller than the scale of the geometric nonlinearities. 

4.1 Numerical Methods for Gauge Freedom 

The basic numerical difficulty with gauge freedom is that because gauge motions 
represent exact redundancies that have no effect at all on the residual error, in a 
classical optimization framework there is nothing to say what they should be: the 
error gradient and Hessian in a gauge direction both vanish, so the Newton step 
is undefined. If left undamped, this leads to large gauge fluctuations which 
can destabilize the rest of the system, prevent convergence tests from operating, 
etc. There are two ways around this problem: 

1. Gauge fixing conditions break the degeneracy by adding artificial con- 
straints. Unless we are clever enough to choose constraints that eliminate 
variables in closed form, this reduces the problem to constrained optimization. 
The constraints are necessarily non-gauge-invariant, i.e. non-tensorial under the 
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gauge group. For example, to fix the 3D projective coordinate freedom, Hartley 
0 sets Pi = {IsxslO) and ^ where P 2 = (H\e). Neither of these 

constraints is tensorial — the results depend on the chosen image coordinates. 
2. Free gauge methods — like photogrammetric free bundle ones — leave 
the gauge free to drift, but ensure that it does not move too far at each step. 
Typically, it is also monitored and reset “by hand” when necessary to ensure 
good conditioning. The basic tools are rank deficient least squares meth- 
ods {e.g. | 2 |). These embody some form of damping to preclude large fluctu- 
ations in near-deficient directions. The popular regularization method mini- 
mizes 1 1 residual] I ^ -I- A^jjstep sizep for some small A > 0 — an approach that 
fits very well with Levenberg-Marquardt-like search control schemes. Alterna- 
tively, a basic solution — a solution where certain uncontrolled components 
are set to zero — can be calculated from a standard pivoted QR or Cholesky 
decomposition, simply by ignoring the last few (degenerate) columns. One can 
also find vectors spanning the local gauge directions and treat them as ‘vir- 
tual constraints’ with zero residual, so that the gauge motion is locally zeroed. 
Householder reduction, which orthogonalizes the rows of ^ w.r.t. the gauge 
matrix by partial QR decomposition, is a nice example of this. 



4.2 Numerical Methods for Constrained Optimization 

There are at least three ways to handle linear constraints numerically: (i) elim- 
inate variables using the constraint Jacobian; (ii) introduce Lagrange mul- 
tipliers and solve for these too; (Hi) weighting methods treat the constraints 
as heavily weighted residual errors. Each method has many variants, depending 
on the matrix factorization used, the ordering of operations, etc. As a rough 
rule of thumb, for dense problems variable elimination is the fastest and stablest 
method, but also the most complex. Lagrange multipliers are slower because 
there are more variables. Weighting is simple, but slow and inexact — stable 
orthogonal decompositions are needed as weighted problems are ill-conditioned. 

For efficiency, direct geometric fitting requires a sparse implementation — 
the features couple to the model, but not to each other. The above methods 
all extend to sparse problems, but the implementation complexity increases by 
about one order of magnitude in each case. My initial implementation uni used 
Lagrange multipliers and Cholesky decomposition, but I currently prefer a sta- 
bler, faster ‘multifrontal QR’ elimination method. There is no space for full 
details here, but it works roughly as follows (NB: the implementation orders 
the steps differently for efficiency): For each constrained system, the constraint 
Jacobian ^ is factorized and the results are propagated to the error Jacobian 
This eliminates the dim(c) variables best controlled by the constraints from 
leaving a ‘reduced’ dim(e) x (dim(x) — dim(c)) least squares problem. Many 
factorization methods can be used for the elimination and the reduced problem. 
I currently use column pivoted QR decomposition for both, which means that 
the elimination step is essentially Gaussian elimination. All this is done for each 
feature system. The elimination also carries the ^ columns into the reduced 
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system. The residual error of the reduced system can not be reduced by chang- 
ing X, but it is affected by changes in u acting via these reduced ^ columns, 

which thus give contributions to an effective reduced error Jacobian for 

the model u. (This is the reduced geometric fitting method’s error function). 
The resulting model system is reduced against any model constraints and fac- 
torized by pivoted QR. Back-substitution through the various stages then gives 
the required model update and finally the feature updates. 



4.3 Search Control 

All of the above techniques are linear. For nonlinear problems they must be used 
in a loop with appropriate step damping and search control strategies. This has 
been an unexpectedly troublesome part of the implementation — there seems to 
be a lack of efficient, reliable search control heuristics for constrained optimiza- 
tion. The basic problem is that the dual goals of reducing the constraint violation 
and reducing the residual error often conflict, and it is difficult to And a compro- 
mise that is good in all circumstances. Traditionally, a penalty function 0 is 
used, but all such methods have a ‘stiffness’ parameter which is difficult to set 
— too weak and the constraints are violated, too strong and the motion along 
the constraints towards the cost minimum is slowed. Currently, rather than a 
strict penalty function, I use a heuristic designed to allow a reasonable amount 
of ‘slop’ during motions along the constraints. The residual/constraint conflict 
also affects step damping — the control of step length to ensure acceptable 
progress. The principle of a trust region — a dynamic local region of the search 
space where the local function approximations are thought to hold good — ap- 
plies, but interacts badly with quadratic programming based step prediction 
routines which try to satisfy the constraints exactly no matter how far away they 
are. Existing heuristics for this seemed to be poor, so I have developed a new 
‘dual control’ strategy which damps the towards-constraint and along-constraint 
parts of the step separately using two Levenberg-Marquardt parameters linked 
to the same trust region. 

Another difficulty is constraint redundancy. Many algebraic varieties re- 
quire a number of generators greater than their codimension to eliminate spuri- 
ous components elsewhere in the space. The corresponding constraint Jacobians 
theoretically have rank = codimension on the variety, but usually rank > codi- 
mension away from it. Numerically, a reasonably complete and well-conditioned 
set of generators is advisable to reduce the possibility of convergence to spurious 
solutions, but the high degree of rank degeneracy on the variety, and the rank 
transition as we approach it, are numerically troublesome. Currently, my only 
effective way to handle this is to assume known codimension r and numerically 
project out and enforce only the r strongest constraints at each iteration. This is 
straightforward to do during the constraint factorization step, once r is known. 
As examples: the trifocal point constraints [x' ]x{G ■ x)[x" = 0 have rank 

4 in (a;, x\ x") for most invalid tensors, but only rank 3 for valid ones; and the 
trifocal consistency constraints det((? • x) — 0 have rank 10 for most invalid 
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tensors, but only rank 8 for valid ones. In both cases, overestimating the rank 
causes severe ill-conditioning. 

4.4 Robustification 

We assume that each feature has a central robust cost function pi{xi) = 
Pi(||ei(xi)|p) defined in terms of a covariance- weighted normalized residual 
error ei{xi) = ei(xi|Xj). This defines the ‘granularity’ — entire ‘features’ (for 
matching constraints, ensembles of corresponding image features) are robustified, 
not their individual components. The robust cost pi is usually some M-estimator, 
often a total log likelihood. For a uniform-outlier-polluted Gaussian it has the 
form p{z) = —2 log -|- /?), where /3 is related to outlier density. Typically, 

p{z) is linear near 0, monotonic but sublinear for 2 ; > 0 and tends to a constant 

at 2 ; — !■ 00 if distant outliers have vanishing influence. Hence, p' = ^ decreases 

,2 

monotonically to 0 and p" = is negative. 

Robustification can lead to numerical problems, so care is needed. Firstly, 
since the cost is often nonconvex for outlying points, strong regularization may be 
required to guarantee a positive Hessian and hence a cost reducing step. This can 
slow convergence. To partially compensate for this curvature, and to allow us to 
use a ‘naive’ Gauss-Newton step calculation while still accounting for robustness, 
we define a weighted, rank-one-corrected effective residual e = and 

effective Jacobian ^ = ^/]7 (I — n^p- ee'^)^ where a = RootOf(^a^ — a — 
^||ep). These definitions ensure that to second order in p and dx and up to 
an irrelevant constant, the true robust cost pdje-l- ^dxp) is the same as the 

naive effective squared error ||e-|- ^dx|p. I.e. the same step dx is generated, so 
if we use effective quantities, we need think no further about robustnes^. Here 
the weighting is the first order correction, and the a terms are the second 
order one. Usually p' ^ 0 for distant outliers. Since the whole feature system 
is scaled by this might cause numerical conditioning or scaling problems 
in the direct method. To avoid this, we actually apply the y^-weighting at the 
last possible moment — the contribution of the feature to the model error — 
and leave the feature systems themselves unweighted. 

5 Measuring Performance 

We currently test mainly on synthetic data, to allow systematic comparisons 
over a wide range of problems. We are particularly concerned with verifying 
theoretical statistical performance bounds, as these are the best guarantee that 
we are doing as well as could reasonably be expected. Any tendency to re- 
turn occasional outliers is suspect and needs to be investigated. Histograms of 

^ If ^||e|j^ < — i the robust Hessian has negative curvature and there is no real 
solution for a. In practice we limit a < 1 — e to prevent too much ill-conditioning. 
We would have had to regularize this case away anyway, so nothing is lost. 
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Ground Truth Residual - 20 points - strong geometry 




Ground Truth Residual - 20 points - Near-Planar (1%) 
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Fig. 1. Ground feature residuals for strong and near-coplanar epipolar geome- 
tries. 



the ground-truth-feature residual (GFR) have proven particularly useful for 
this. These plot frequency vs. size of the total squared deviation of the ground 
truth values of the noisy features used in the estimate, from the estimated match- 
ing relations. This measures how consistent the estimated geometry is with the 
underlying noise-free features. For weak feature sets the geometry might still 
be far from the true one, but consistency is the most we can expect given the 
data. In the linear approximation the GFR is xi distributed for any sufficient 
model and number of features, where v is the number of d.o.f. of the underly- 
ing inter-image geometry. This makes GFR easy to test and very sensitive to 
residual biases and oversized errors, as these are typically proportional to the 
number of features n and hence easily seen against the fixed background for 
n V. For example, figQ shows GFR histograms for the 7 d.o.f. uncalibrated 
epipolar geometry for direct and reduced F-matrix estimators and strong and 
weak (1% non-coplanar) feature sets. For the strong geometry both methods 
agree perfectly with the theoretical Xr distribution without any sign of outliers, 
so both methods do as well as could be hoped. This holds for any number of 
points from 9 to 1000 — the estimated geometry (error per point) becomes more 
accurate, but the total GFR error stays constant. For the weak geometry both 
methods do significantly worse than the theoretical limit — in fact they turn 
out to have a small but roughly constant residual error per point rather than in 
total — with the direct method being somewhat better than the reduced one. 
We are currently investigating this: in theory it should be possible to get near 
the limit, even for exactly singular geometries. 

6 Summary 

We have described work in progress on a generic, modular library for the optimal 
nonlinear estimation of matching constraints, discussing especially the overall 
approach, parametrization and numerical optimization issues. The library will 
cover many different constraint types & parametrizations and feature types & 
error models in a uniform framework. It aims to be efficient and stable even in 
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near-degenerate cases, e.g. so that it can be used reliably for model selection. 
Several fairly sophisticated numerical methods are included, including a sparse 
constrained optimization method designed for direct geometric Gtting. Fu- 
ture work will concentrate mainly on (z) implementing and comparing different 
constraint types and parametrizations, feature types, and numerical resolution 
methods; and (ii) improving the reliability of the initialization and optimization 
stages, especially in near-degenerate cases. 
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Abstract. The objective of this work is to automatically estimate the 
trifocal tensor and feature correspondences over images triplets, under 
unrestricted camera motions and changes in internal parameters between 
views. To this end we extend a previous wide baseline 2- view algorithm to 
3-views. The algorithm is based on establishing feature correspondences 
between views together with a homography which enables a viewpoint 
invariant affinity score. The input is the three images, and the output 
the trifocal tensor and image correspondences, which in turn facilitate a 
projective reconstruction of the cameras and scene. 

We also investigate the direct computation of the fundamental matrix 
for view pairs, and the trifocal tensor for view triplets, from the homo- 
graphies alone. This method is successful, but not as yet as accurate as 
computation from point correspondences. 

Finally, it is shown that the 3-view algorithm allows reconstruction of 
3D points, lines and cameras from a disparate set of 11 views. All the 
algorithms have been implemented and assessed on real images, and pro- 
cessing is automatic throughout. 



1 Introduction 

Over the past decade algorithms have been developed which simultaneously 
compute feature correspondences and the epipolar geometry between pairs of 
images I231I2BI; and feature correspondences and the trifocal geometry between 
triplets of images ps). These algorithms are robust and reliable and require only 
information derived from the images. In particular no information on the camera 
internal parameters or camera motion need be supplied. However, the algorithms 
require a restricted camera motion and limited change in internal parameters be- 
tween views. If these restrictions are not satisfied then the algorithms will fail. In 
previous work we have extended 2-view algorithms to enable feature matching 
and geometry estimation between two quite disparate views — “Wide baseline 
stereo matching” ng. We review this approach in section El 

In this paper we extend this work in two ways: first, a matching and geometry 
estimation algorithm is developed for three views. The input is three images of 
a scene acquired from unrestricted viewpoints, and with unrestricted camera 
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internal parameters. The output is a set of corresponding point features and the 
estimated trifocal tensor El CHI ES] for the view triplet. This is described in 
section 0 

The second extension is an algorithm for 2-view and 3-view geometry esti- 
mation which does not require point (corner) correspondences. In this case the 
fundamental matrix and trifocal tensor are computed directly from planar ho- 
mographies. The algorithm is applicable to images where corner correspondences 
are not available, but planar homographies can be computed by other means, e.g. 
from line correspondences. This is ideally suited to piecewise planar scenes [^j. 
The algorithm is described in section g] 

Once the trifocal tensor is available for three views it supports the automatic 
matching of lines and curves m, and enables correspondences to be established 
over a large set of views by concatenating triplet matches |21 El • It is then 
possible to reconstruct scenes, of buildings for example, acquired from many dis- 
parate views — such as a set of photographs — and also determine the viewpoint 
of each image. There have been a number of recent photogrammetry applications 
where such wide baseline matching is required 0, E21 ^ ■ These applications are 
generally aimed at reconstruction from multiple views, where the baseline is 
large to improve the accuracy of reconstruction or a small number of views is 
used to cover all aspects of the object. Currently these applications are only 
partly automated, and require some or all correspondences to be supplied by 
hand. In contrast, the algorithms of this paper enable automatic computation of 
cameras and 3D point and line structure over a set of wide baseline views. This 
is demonstrated in section 0 

2 Recapitulation of the Two View Case 

2.1 Failure of Small Baseline Algorithm 

The first step of algorithms such as |23l EHl EH] is to establish a set of interest 
point correspondences across views. It is not necessary that all of these correspon- 
dences are correct, since robust estimation is used, but a significant proportion 
must be. Establishing these correspondences rests on two assumptions: first, that 
the intensity of fixed-size and orientation image neighbourhoods around images 
of a 3D point are similar. Cross-correlation on these intensity neighbourhoods 
then provides an affinity measure to disambiguate image point correspondences 
between views; second, that the image point motion is limited between views. 
This latter assumption enables a restricted search region to be defined, which 
in turn limits the number of potential matches, and thus lessens the reliance on 
the disambiguating power of the affinity measure. 

Both of these assumptions are violated when there is significant change in 
either the internal camera parameters or a significant motion between views. 
Consider a camera motion consisting of a translation parallel to the image x- 
axis, followed by a 90° rotation clockwise about the camera principal axis (i.e. 
a rotation with axis perpendicular to the image plane). Cross-correlation of an 



80 



Philip Pritchett and Andrew Zisserman 



intensity neighbourhood will fail (because of the rotation) and the position of 
an imaged point can move entirely across the image, e.g. A point on the left side 
of the image will move to the top. 



2.2 Viewpoint Invariant Affinity Measure 

Both of these problems are corrected by using a homography (a plane projective 
transformation) to map between the images. Continuing with the above example, 
a homography which provides a 90° rotation will return the situation to the small 
baseline conditions. A correction of this type is based only on the camera motion 
and is similar to classical rectification where images are projectively warped to 
align the epipolar lines with corresponding scan-lines. This will be referred to as 
employing a global homography. 

However, global homographies will not always be sufficient since if there is 
a significant change in perspective effects between images, the correct mapping 
is a homography induced by the tangent plane of the surface at the 3D point of 
interest. This is a scene dependent homography, and several local homographies 
may well be required for a particular pair of images. 

Thus, the small baseline algorithm is augmented with a homography which 
is used for two functions 

1. The homography provides the map between interest point neighbourhoods 
for the cross-correlation affinity measure. 

2. The homography is also used to transfer the point from one image to another 
in order to define the centre of the search region. A similar idea is used in 0 
for the case of dense stereo matching. 

For both functions the homography need only approximate the point-to-point 
map between the images — we are only seeking to improve the disambiguation 
power of cross-correlation and restrict the search region. 

The problem of wide baseline stereo matching is thus reduced to homography 
estimation followed by small baseline matching. A single (global) homography 
may suffice, or a set of local homographies may be required. 



2.3 Wide Baseline Stereo Matching Algorithm 

The algorithm consists of 3 main steps: 

1. Automatically generate a (set of local) planar homographies. 

A global homography may be computed by standard pyramid search tech- 
niques misi Often a similarity transformation suffices. 

Local (scene-dependent) homographies may be computed using matching 
techniques similiar to those used in model-based recognition H3I For exam- 
ple, using feature-focus |3| to identify and match distinctive image features. 
In [ 1 3j four-line groupings are matched between images. Homographies are 
then computed from the corresponding lines. 
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2. For each homography generate a set of interest point matches. 

Interest point matches are consistent with the homography if both their 
position and intensity neighbourhood are similar after mapping by the ho- 
mography. 

3. Estimate the fundamental matrix and consistent correspondences. 

Sets of matches are generated from each local homography and then com- 
bined. The fundamental matrix, F, and consistent matches are estimated 
using the RANSAC |t| robust estimator in the manner of the small baseline 
algorithms |23 (the RANSAC sample consists of 7 correspondences in this 
case) . A maximum likelihood estimate of F and the correspondences is then 
obtained via a robust non-linear minimization. 

Further details are given in ng. 

3 Extension to Three Views: Trifocal Tensor Estimation 

The small baseline trifocal estimation algorithm 0I25I is extended to a wide 
baseline in a similar manner to the previous section. Point correspondences are 
here required over three views, and again homographies enable the small baseline 
algorithms to be used. 

The algorithm for simultaneously estimating the trifocal tensor, T, and con- 
sistent 3- view point correspondences has the following steps: 

1. Generate homographies: Compute a (set of) homographies between views 
1 & 2, and 2 & 3 using the methods of section |S1 

2. 2 view correspondences: Compute interest point correspondences (and 
the fundamental matrix) between views 1 & 2, and 2 & 3 using the wide 
baseline stereo algorithm. Each correspondence has an association with a 
homography. 

3. Putative three view correspondences: Compute a set of interest point 
correspondences over three views by joining the 2-view match sets: 3-view 
matches are formed from those 2- view matched points that share a common 
point in one of the images. 

4. RANSAC robust estimation: of T based on samples of 6 point corre- 
spondences. 

5. Maximum Likelihood Estimation (MLE): by minimizing reprojection 
error, re-estimate T, and perfectly consistent correspondences, from all the 
measured correspondences classified as inliers. 

6. Guided matching: Further interest point correspondences are determined 
using the estimated T. Putative two view matches are obtained for views 
1 & 2, and then verified using view 3. Verification involves two tests: first, 
there is a threshold on the image distance between the measured point in the 
third view, and the point transferred by T from its two view match; second, 
there is a threshold on the affinity measure computed using the homography 
associated with the two- view match. 

The last three steps can be iterated until the number of correspondences is stable. 




V 12 V 13 V 14 



Fig. 1. Valbonne images. 
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3.1 Results 

Table Q] shows the number of corner correspondences for various triplets of the 
Valbonne image set (shown in figure Q . Typical examples of Valbonne triplets 
are given in figure 0 Further examples of automatically estimating the trifocal 
tensor and correspondences are shown for different scenes in figures 0-0 



Triplet 


Joined Pair 
Matches 


Inliers 
to T 


T Guided 
Matches 


Inliers 
to T 


Matches after 
Iteration 


3 4 5 


71 


60 


71 


64 


102 


4 5 6 


94 


90 


85 


72 


109 


5 6 7 


98 


92 


102 


81 


134 


6 7 8 


46 


42 


48 


39 


84 


7 8 9 


70 


56 


62 


48 


57 


8 9 10 


109 


75 


72 


43 


72 


9 10 11 


120 


106 


110 


86 


137 


10 11 12 


189 


172 


169 


144 


185 


11 12 13 


207 


187 


185 


150 


219 


12 13 14 


191 


164 


165 


135 


209 



Table 1. Three view matching for triplets from the Valbonne set (see figuresUi 
-\W- The Joined Pair Matches are the matches formed by taking points in the 
second image which have been paired with matches in both the first and third 
images. An estimate ofP is then computed using RANSAC, the number of inliers 
to this solution are given. The T Guided Matches are found using the method 
described in section □ The Matches after Reration are found by iterating the 
guided matching and re-estimation of T . See discussion in seetion ti. /I 



We first describe matching a triplet from the Valbonne set (figure 0) in 
greated depth and then discuss the results for the entire set of triplets. 



Triplet Vll— V13 In order to clarify the results of Table 0 we examine each 
stage of the matching process for the triplet Vll V12 and V13 (the first triplet 
of figure 0 . 

— The Harris corner detector m is used to find 970, 970, and 976 corners 
respectively in the three images. 

— Global similarity transformations that approximately register the images are 
computed between pairs Vll & V12, and VI 2 & VI 3. 

— The widebaseline stereo matching algorithm with global homographies gives 
372 corner matches between Vll & V12, and 362 matches between V12 & 
V13. 

— The join of the pair matches gives 207 triplet matches. 

— An initial estimate of T, generated from the initial triplet matches using 
RANSAC, has 187 matches as inliers. 
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— Using this estimate of T, guided matching finds 185 matches, of which 150 
are inliers to a second estimate of T using RANSAC. 

— After further iterations of guided matching, and re-estimation of T, a final 
219 are found. 





Fig. 2. Valbonne Triplets. The trifocal tensor is estimated using corner 
matches and a global homography affinity score. Five of the matched points are 
shown together with their corresponding epipolar lines in the second and third 
images. The epipolar geometry is determined from the estimated trifocal tensor. 
The trifocal geometry is illustrated in this manner in all of the following triplet 
figures. The number of matched points for these examples is given in tabled 
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DO D 6 D 12 



Fig. 3. Model house triplet. The trifocal tensor is computed from point 
matches supported by three local homographies for the affinity score. IfO points 
are matched over the 3 images. The local homographies are computed from 
matched line groupings. In the first, second and third views there are 25, 21 
and 30 respectively four-line groupings. Three groupings are matched between 
views DO & D6, and views D6 & D12, and provide the local homographies. The 
angle between views DO and D12 is approximately 60°. 



Discussion In examining table Q] horizontally, it can be seen that for all but 
two of the triplets (V7-V9 and V8-V10) the number of matches produced after 
repeated iteration of the guided matching is greater than the number of matches 
produced by joining the pair match sets. However, these two triplets are the 
ones with the smallest proportion of correct joined pair matches (as indicated 
by inliers to T). Guided matching is productive because putative matches that 
have previously been eliminated as “love-triangles” UH are no longer erroneously 
excluded. 

Examining the table vertically, it is evident that the number of matches 
produced between triplets is primarily related to the stability of corners between 
images. As the camera motion is increased, more corners become occluded and 
new corners are introduced which increases the likelihood of mismatches. Also, 
the matching algorithm is dependent upon the stability of the Harris corners. 
The stability of the corners is determined by the resolution at which patches 
in the world are imaged m corners found in one image are more likely to be 
lost as the camera is zoomed more significantly, or the surface viewed at an 
increasingly oblique angle. 



4 Direct Estimation of View Relations from 
Homographies 

This section describes how homographies alone can be used to compute the 
fundamental matrix and trifocal tensor. This is useful in situations where point 
correspondences are not available, but where homographies arising from world 
planes can be computed. 
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K 1 K 2 K 3 



Fig. 4. Kapel (chapel) triplet. The trifocal tensor is computed using point 
matches and a global homography affinity score. 62 points are matched over the 
triplet. 



4.1 Using Planes to Compute the Fundamental Matrix 

Suppose there are two planes, -ka and in the scene, and these planes induce 
homographies, and Hs respectively, between two views. Then the homography 
H = is a mapping from the first image onto itself with the following 

properties: 

— H is a planar homology (see I2HEE]). It has a line of fixed points, which is 
the image of the intersection of the two planes, and a distinct fixed point 
which is the epipole e in the first image. 

— A planar homology H has two equal eigenvalues. The corresponding eigen- 
vectors define the line of fixed points. The eigenvector corresponding to the 
non-degenerate eigenvalue is the epipole in this case. 

The action of the homology is shown in figure El The fundamental matrix is 
computed as (see cni) 



F = [H^e] X Hi for i = A or B 



( 1 ) 



4.2 Using Planes to Compute the Trifocal Tensor 

Similarly, the trifocal tensor can be estimated from two planes which induce 
homographies over three views. Denote these homographies by H(^2> ^121 between 
the first and second views, and Hfg, between the first and third. 

The trifocal tensor is determined indirectly from the camera matrices. The 
3 x 4 camera matrices for the three views may be written using the homography 
induced by the first plane as 

P = [I I 0], P' = [h (^2 I e'], P" = [h ()3 I Ae"] (2) 

Since the epipoles, e' and e", can be obtained from the two view homographies, 
as described above, only the scale factor A is unknown. This scale factor can be 
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Fig. 5. The action of the map H = on a point ha in the first image, is to 

first transfer it to x' as though it was the image of the 3D point X^, and then 
map it back to the first image as though it was the image of the 3D point 'Kb- 
Points in the first image on the image of the intersection of the two planes will 
be mapped to themselves, so are fixed points under this action. The epipole e is 
also a fixed point under this map. 

computed from the homography induced by the second plane. For this homog- 
raphy to be consistent with the first, the following relations must be satisfied: 

Hfa = + e'v^ (3) hfg = + Ae"v^ (4) 

The 3- vector v here represents the plane in the projective frame defined 
by cameras P and P'. Once v has been found we can then compute A. The 
computational procedure is the following 

1. Solve for v using (3). 

— To solve for v a scale factor a must be explicitly included 

Hf2 = a(Hj^2 + e'v^) 

— Solve for a by pre-multiplying both sides by [e']^ giving 

[e'lxHfa = a[e']xHj^2 

— Solve for v by taking the scalar product on the left with e' giving 

V = (Hf2/a-Hf2)^e7 lie'll^ 

2. Given v determine A from (4). 

— This proceeds in the same manner as above with a scale factor (3, giving 
Av = (Hf3//3-Hjl3)Te"/||e"||" 
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3. Compute the camera matrices P, P', P". 

4. The trifocal tensor is then computed from the three camera matrices as 
described in HH. 



4.3 Results 

Estimation of F The homographies used to compute F are generated by using 
RANSAC on the 153 widebaseline point matches to pick out the two largest 
sets of points consistent with homographies from the pair DO & D6 - shown in 
figure ini The homographies, Hs, correspond to the planes of the house wall 
and ground. 




D 12 D 6 



Fig. 6. For each image pair, DO & D6, and D6 & D12, the two homographies 
with most support are searched for amongst the matches resulting from the wide 
baseline stereo matching. For each pair the points contributing to the first ho- 
mography are represented by circles and their convex hull is drawn in white, and 
the points consistent with the second homography are represented by squares and 
their convex hull is drawn in black. 

Using these homographies estimated independently, the matrix H = 
has eigenvalues -1.47, -0.93, -0.99. This indicates that the matrix is close to a 
homology (for which the last two eigenvalues would be exactly equal), but that 
the two homographies are not exactly consistent with the epipolar geometry. 
The next step is to estimate the homographies such that they are consistent, we 
return to this in section El 




Matching and Reconstruction from Widely Separated Views 



89 



Estimation of T The homographies used to estimate T are determined, in a 
similar manner to above, by using RANSAC to pick out the two largest sets of 
points consistent with homographies from pairs DO & D6, and D6 & D12. These 
points are shown in figure 1^1 

Figure 0 shows the accuracy of T by displaying the epipolar lines (derived 
from the trifocal tensor) for points lying both on and off the planes. 




DO D 6 D12 



Fig. 7. Model house triplet. The trifocal tensor is computed from the homogra- 
phies relating the planes shown in figure\^ 



5 Matching and Reconstruction over Multiple Views 

When more than three views are available, the three view matches and tensors 
can be integrated into a consistent reconstruction over all views |P . This enables 
the cameras for all views to be determined in the same coordinate system, and 




(a) (b) 



Fig. 8. Views of recovered 3D point and line structure and cameras. There are 
316 points and 92 lines, (a) a plan view of the church and the centers of cameras 
computed from 11 views of the scene, (b) a close up, including the directions of 
the camera’s principal axes. 
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for accurate 3D structure estimation by minimizing reprojection error over all 
views in which a feature appears. 




Fig. 9. Views of the piecewise planar model generated from the 3D point and 
line structure shown in figure O The model has been texture-mapped using the 
original images from the Valhonne set - figure El 



The tensors and correspondences are determined for triplets from 11 images 
of the Valbonne set (figure P) using the algorithm of section 0 The structure 
contains 316 points. Using the line matching algorithm of Schmid together 
with the computed camera matrices, 92 lines are recovered. Views of the 3D 
points, lines and cameras are shown in figure 0 The lines are used to create the 
polygonal model shown in figure 0 which is texture mapped from the original 
images. 

6 Discussion 

We have demonstrated that a reconstruction of cameras and 3D scene structure 
can be generated automatically from a set of disparate views. Two types of al- 
gorithms have been engineered, one based on interest point correspondences the 
other on direct computation from planar homographies. The point based algo- 
rithm is the more mature, and achieves the sub-pixel accuracy of small baseline 
algorithms. 

The direct estimation from homographies is more preliminary and standard 
problems arising in multiple view relation estimation are still to be solved. For 
example, a maximum likelihood estimation of F (or T) from two planes would 
involve minimizing the “reprojection” error between measured points and cor- 
rected points mapped under the homography, subject to the homology consis- 
tency constraints on the two homographies. 

Another problem that requires more investigation is in obtaining estimates 
of F or T when homographies arising from more than two world planes are 
available. In the case of estimating these 2-view and 3- view relations from point 
correspondences - it is well understood how best to use more than the minimum 
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number of correspondences to improve the estimate of the view relation, but this 
has yet to be formulated for the estimation from homographies case. 

A final extension of this work is that of estimating homographies for scene 
types where obtaining point or line correspondences may not be possible e.g. for 
a room with untextured walls. 
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Abstract. A block-based disparity estimator is proposed that considers 
the non-uniform spatial distribution of the estimation error inside an im- 
age block that is mapped into another image plane nsing projective 2-D 
transformations. For this purpose, first the error variance distribution 
inside the image block is analytically derived. The derived error variance 
distribution shows four eccentric minima. As a consequence, the pro- 
posed disparity estimator arranges the image block eccentrically around 
the picture element (pel) to be evaluated. Thus, a reduction of the es- 
timation error variance by a factor between 1.5 and 2 can be achieved 
compared to known block-based disparity estimation techniques. Since 
the error distribution shows four minima, four possible arrangements of 
the image blocks and hence four independent estimates can be made. 
A further reduction of the estimation error variance by a factor up to 
2.6 can be achieved, when the four estimates are averaged. Additionally, 
an outlier detection and removal in the set of four estimates enables an 
increased robustness. The proposed estimator is tested using both, syn- 
thetic image pairs of known disparity and real images. The expected error 
reduction performance of the estimator and its increased robustness are 
verified. 



1 Introduction 

A key problem in motion and depth estimation of real objects is to find corre- 
sponding image content in a pair of images which result from a projection of the 
same object into both image planes Object motion leads to a displacement 
between corresponding image content in a consecutive pair of images of a mono- 
scopic image sequence |2I. On the other hand, object depth causes a disparity 
between corresponding image content in a stereoscopic pair of images |3j . Since 
the estimation of disparity and displacement can be put down to the problem of 
finding corresponding image content, similar estimation techniques are applied. 

In order to estimate displacement for a particular picture element (pel), 
block-based estimation approaches arrange a reference block oi N ■ N pel around 
the pel to be evaluated. This reference block is mapped into the other image 
plane using a 2-D transformation and its luminance signal is compared to the 
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luminance signal in the other image. The 2-D transformation parameters specify 
the displacement of the evaluated pel. They are estimated by minimising the 
differences between the luminance signals. 

Essentially, three different 2-D transformations for displacement estimation 
are known from literature which differ in their number of transformation param- 
eters: 

— translational transformation using 2 transformation parameter E) PI P 

— affine transformation using 6 transformation parameters PI PI p mu HD 

— bilinear transformation using 8 transformation parameters H2i 

In general, the relation between two projections of the same physical object 
into two different image planes can not be modelled perfectly by a 2-D trans- 
formation of an image block. This leads to so-called model failures and thus to 
inaccurate estimation results. In order to determine the accuracy of displacement 
estimates, Buschmann recently derived a method for determining the displace- 
ment estimation error variance m. Since the real displacement vector field is 
unknown, the real displacement vector field is modeled as a zero-mean stochastic 
AR(1) process with given power spectral density. By relating the estimation re- 
sults of the transformation parameters to the spatial displacement distribution 
inside the reference image block, an estimation error measure is derived that 
depends on the parameters of the stochastic process. 

In ^3|; Buschmann extended the analysis of the displacement estimation 
error variance. He determined the displacement estimation error not only for 
the evaluated pel but also for all pels inside the reference image block depending 
on their position inside the block. With this extension, the spatial distribution 
of the displacement estimation error variance inside the reference image blocks 
is obtained. For an numerical evaluation of the spatial distribution, Buschmann 
empirically determined the parameters of the stochastic process by evaluating 
the displacement vector field of a set of videophone sequences. It turned out that 
the minimum of the displacement error is only in the center of the block in case of 
translational transformation. In case of affine and bilinear transformation, four 
eccentric minima were found. So far, the eccentric position of the minima is not 
analytically determined and not experimentally verified. All known displacement 
estimation approaches arrange the reference image block concentrically around 
the evaluated pel, where the estimation error variance is not minimal. Thus, 
the knowledge about the eccentric position of the minima is not yet used for an 
improvement of the displacement estimator. 

In this paper, Buschmann’s approach for calculating the displacement esti- 
mation error variance |E| is applied to disparity estimation and extended in 
order to improve the estimator. In case of disparity estimation, the transforma- 
tion is constrained by the epipolar geometry of the stereo camera setup. Thus, 
the number of required transformation parameters is reduced by a factor of two. 
Furthermore, the stochastic properties of disparity fields are different from those 
of displacement vector fields. This difference is caused by the specific setup of 
a stereoscopic camera that introduces a disparity offset and scales the disparity 



Improving Block-Based Disparity Estimation 



95 



variance depending on the distance between the cameras. Thus, the stochastic 
process has to be adapted. 

In order to improve the block-based disparity estimator, the distribution of 
the estimation error variance is analytically derived depending on the parame- 
ters of the stochastic process. In particular, the positions of minimal estimation 
error variance inside the reference image block are determined. Depending on 
their positions, the reference image blocks are arranged eccentrically around the 
evaluated pel, such that an estimator with minimum estimation error variance 
is obtained. The decrease of the estimation error variance of the new estima- 
tor is analytically determined and verified using synthetic noisy images and real 
images. 

In SectionEl the applied transformations are introduced. The estimation error 
variance distribution is specified in Section Olfor bilinear, affine and translational 
transformations. The proposed estimator is introduced in Section 0 Section 0 
presents experimental results, followed by conclusions in SectionEl 

2 Block-Based Disparity Estimation 

The task of binocular disparity estimation in a 3-D reconstruction approch is to 
find corresponding image points in a stereoscopic pair of images and compute 
the according 3-D point position by a triangulation of the lines of sight of both 
image points. For correspondence analysing block-based disparity estimation ap- 
proaches arrange a reference block around the pel to be evaluated. This reference 
block is mapped into the other image plane using a projective 2-D transforma- 
tion. The transformation specifies the disparity between the evaluated pel and 
the corresponding pel in the other image plane. 

In this section, at first the applied block-based 2-D transformations with their 
transformation parameters are defined. Then, the estimation of transformation 
parameters with an Maximum-Likelihood estimator is introduced. 

2.1 Definition of Image Block Transformation 

In Figure Q a reference block and the homologous block are shown. The vector 
Pi = denotes the position of an image point in the local coordinate 

system of the reference block. The vector denotes the position 

of the transformed, homologous point. 

A review in recent publications shows that currently mainly three differ- 
ent 2-D transformations are used to approximate the real mapping of image 
content from one image into another: translational, affine and bilinear transfor- 
mation. Assuming standard stereo geometry with horizontal epipolar lines, the 
y-coordinate is not effected by the transformation, i.e. t/f = t/i. The mapping 
function of the x-coordinate is expressed by ■ A. In this function the 

vector A contains the transformation parameters and B describes the relation 
between the local coordinates inside the block and the transformation parame- 
ters. For the three applied 2-D transformations, the mapping functions and the 
vectors A and B are given in Tab. 0 
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Fig. 1. Transformation of a reference block into a homologous block in case of 
horizontal epipolar lines. 



Table 1. Definition of 2-D transformations 



transformation 


mapping function 




A 


translational 


Xi = a-i_ + Xi 




{aiAf 


affine 


Xi = ai -1- a2Xi -1- a^yi 


i'^,Xi,yi) 


(oi, 02, oa)^ 


bilinear 


x’l = ai-\- a^Xi -1- a-iyi + a^Xiyi 


{l,Xi,yi,Xiyi) 


(oi, 02, 03 , 04 )^ 



In the following only bilinear transformation is considered, since affine and 
translational transformation are subsets of it and can be easily extracted by 
omitting the higher order transformation parameters. 

The disparity d(pi) at position pi inside the reference block is given by the 
difference between the x-coordinates of Pi and p/*. With the parameter vector 
of the identical transformation 



f; = (o,i,o,o)^, (1) 

which maps a point onto itself, the disparity can be written as a function of the 
used transformation and their transformation parameters 

d{pi) =xi-x’l = 3"^ ■E-B'^ ■ A = -B^ ■ Aa (2) 

The differential transformation parameter vector Aa is given by 

Aa = A- E (3) 

2.2 Maximum-Likelihood (ML) Disparity Estimator 

In order to estimate the disparity d{pi), the differential transformation para- 
meters Aa have to be estimated ( 0 . This is done by evaluating the luminance 
difference between the left and right image-block IT^ . [T5I . ITCI 

fd{pi) = Sl{pi) - Sr{Pi) = Sr{Pi) - Sr(pi). 



( 4 ) 
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With the mapping functions of table Q and a linearisation concerning the trans- 
formation parameters, equation Q) can be calculated to 

fd{p,) = ^ (5) 

In 0 the index k denotes one of the m transformation parameters. Assuming 
that the observation of the luminance signal is only disturbed by Gaussian noise, 
the ML-estimation of the differential transformation parameters leads to the 
method of leads squares nn]. Therefore, the estimation result is given by 

Aa= {H^H) ( 6 ) 

In and fd the Nb observations inside the transformation block are collected: 

H = (^H^ipi), H^{P2), ■■■, H^iPNs)) 
fd= (^fd{pi), fd{p2), •••, fdipNs)) 

With ® and (0) the estimated disparity is given by 

d{p,) = -B^ ■ fd. 

In order to compensate for the linearisation errors, the estimation is car- 
ried out iteratively. The estimated differential transformation parameters are 
accumulated over all iterations. In case of large disparities, where a differential 
estimation technique fails because of the assumption of a monotonic image sig- 
nal, a robust disparity search technique CZl is applied for an initialisation of the 
disparities. 

3 Local Disparity Estimation Error Variance Distribution 

For a derivation of the spatial distribution of the disparity estimation error inside 
the blocks, the disparity estimation error is defined by 

de{pi) = d{pi) - d{pi) . (9) 

The stochastic properties of the disparity estimation error are described by the 
error variance 



(7) 

(8) 



<^deiPi) = E [{,de{pi) - mde{.Pt)f] ■ (10) 

In order to compute the error variance of the estimator without knowing the 
real displacement vector field, Buschmann described the real displacement vector 
field using a stationary stochastic process. In particular an AR(l)-process is used 
which is described by its variance cr^g and its autocorrelation coefficient pdd- In 
contrast to displacement estimation, where a zero mean vector field is expected. 
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an additional mean value rud is considered in case of disparity fields since the 
shift between the two cameras introduces an offset. Assuming an isotropic ACF- 
model, the ACF of the process is defined by 



Rdd{^x,Ay)=alp{f^^^^ + ( 11 ) 

With this model, Buschmann analytically derived the spatial error distribution 
for translational, affine and bilinear transformation. By restricting the image 
block transformations to horizontal epipolar geometry, the horizontal component 
of the displacements is turning into disparity and the vertical displacement is 
equal to zero. Since Buschmann assumed statistically independent displacement 
components, also the estimation error variance distribution of the horizontal 
displacement component corresponds to the one of the disparity estimate. For 
translational transformation the error variance is 






MN 



Rdd{Xi-Tx,yi-Tx)dTydTx (12) 
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In case the affine transformation the error variance is 



(^de,dAi^yi) = + 



(13) 
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and for bilinear tranformation it is defined by 
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(14) 
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These integrals give the error variance distribution inside the reference image 
blocks in relation to the ACF of the AR(l)-process. An evaluation of equation 
dH with shows, that the variance of the real disparity scales the error 
variance cr^g linearly. Therefore, the normalized error variance is shown 

in Fig. El in dependence on the local coordinate Pi = {xi,yi)'^ inside the image 
block. A typical autocorrelation coefficient of pdd = 0.95 is chosen. In addition to 




translational affine 

transformation transformation 



bilinear 

transformation 



Fig. 2. Spatial disparity error variance distribution inside blocks of the size 
17x17 pel. Visualised as 3-D diagram and height contours for three different 
transformations (other scale for translational transformation) . 
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the 3-D diagrams of the spatial error variance distribution, the height contours 
are shown. There, the position of the minimum can be recognised more easily. 

The first remarkable result is that the error variance in the central postition 
is exactly the same for all three transformations. This means that the estimation 
error variance can not be decreased by using more sophisticated transformations, 
if the transformed image block is concentrically arranged arround the evaluated 
pel. This seems to contradict to previous investigations which showed that affine 
transformation is superior to translational even in case of concentrically arranged 
blocks 13- Of course, this derivation is only valid, when the real desparity field 
can be modeled using an AR(l)-process. This applies to natural objects with 
curved surfaces. In case of large planar surfaces which are arbitrarilly oriented 
in 3-D space, the model of an AR(1) process is not appropriate. Thus, affine 
or bilinear estimation can provide more accurate estimates than translational 
transformation even if the reference block is concentrically arranged arround the 
evaluated pel. 

The second result is that there are four eccentric minima in case of affine and 
bilinear transformation, which are eccentrically arranged arround the center of 
the block. An investigation with varying autocorrelation coefficients shows that 
their position {x min, y min) is proportional to the chosen block size N and varies 
slightly with pdd- It varies less than 0.03A^ for 0.75 < pdd < 1, the typical range 
of autocorrelation coefficients of real disparity fields. In case of a block size of 
17x17 pel (N=17) the position of the minimum is Xmin = Vmin = 4 for affine 
and Xmin = Vmin = 5 for bilinear transformation. 

The third result is that pdd effects the error variance itself. The error variance 
rises with decreasing autocorrelation pdd between neighbouring disparities. The 
reason is that with small statistcal bindings between neighbouring disparities 
the implicit assumption about deterministic course of the disparity that results 
from the 2-D transformation involves larger model failures. 

The fourth result is that pdd also effects the ratio between the error variance 
in the center and the error variance in the eccentric minimum cr^g ggg. 

The ratio pecc = ce.nl ecc i® denoted as the gain, since an according gain 
is expected from an estimator that arranges the reference block eccentrically 
instead of concentrically. In case of affine transformation the gain varies between 
1.2 and 1.35. In case of bilinear transformation a gain between 1.45 and 1.95 can 
be archieved. Since the gain is higher in case of bilinear transformation it is 
applied to the new estimator. 



4 Disparity Estimation with Eccentric Image Blocks 



As shown in Fig. la the error distribution inside the image blocks shows four 
eccentric minima. This knowledge shall now be exploited for an improvement 
of the estimator. In all known approaches ii 0 [iDi [ra, the image blocks are 
arranged concentrically arround the evaluated pel, where the estimation error 
variance is not minimal according to the analytically derived error distribution. 
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A new estimator is therefore proposed that arranges the image blocks eccentri- 
cally arround the evaluated pel. Since four minima exist, there are four possible 
arrangements of the image block such that the evaluated pel lies in one of the 
minima (Fig. |3). 




Fig. 3. Four possible eccentric arrangements of the reference block around the 
pel to be evaluated 



With the eccentric arrangement of the image blocks, a gain in terms of re- 
ducing the estimation error variance is obtained. This gain depends on the ratio 
between the error variance in the center and the error in the minima 

^de,ecc and is expressed by gecc = ^le,ecc- 

A further improvement is expected by estimating with all four possible ar- 
rangements of the image block. Thus, the proposed estimator averages the four 
possible estimates arithmetically (uni. 

dres = ^ (di -l- ^2 + da + di) (15) 

Assuming that the error involved by each observed pel in the image block is 
uncorrelated and has the same statistical properties, the estimation error vari- 
ance decreases reciprocally with the number of observed pels when using an 
Maximum-Likelihood (ML) estimator. Since the four eccentric image blocks 
partly overlap, the number of observed pels is not increased by a factor of 4. 
The computation of the number of observed pels is derived in Fig. El results 
in 2.64 times more observed pels. Thus a maximum gain due to averaging of 
gav < 2.64 is expected from the new estimator. 

The same gain can not be expected from a single estimation with a block 
size 64, because a single estimation with larger blocks leads to additional model 
failures that decrease the accuracy of the estimator. In case of the new estimator 
the block size of each individual estimation is not changed and thus no additional 
model failures are caused. 

Beside the possibility of arithmetically averaging the four estimates, also a 
more appropriate averaging might be used that leads to more robust estimates. 
Such improvements are envisaged as a next step, but have not been investigated 
so far. 
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Shift i between center of the block and 
position of minimal estimation en'or 




Resulting areas A of evaluated pels 
□ ^1 = b\ 

IHIA4 = = {bf+2-sy = 2.64 ■ b] 



Fig. 4. Relation between the number of observation points in a single block of 
size bi and four partly overlapping blocks of size bl which cover the same area 
as a single block of size 64 



5 Experimental Results 



In this section the theoretically derived error variance distribution and the gain 
obtained by the proposed estimator are experimentally verified. For a numer- 
ical verification of the error variance distribution and the derived gain of the 
new estimator, the real disparity of the evaluated image pair has to be known. 
Therefore, two synthetic image pairs were created by rendering two virtual 3-D 
objects. In order to consider the camera noise, an additional Gaussian noise was 
added to the synthetic images. Since the real disparity is known, the disparity 
estimation error can be measured. 

In Fig. 0the left luminance image and the real disparity field of a synthetic 
image pair called SynAR is shown. 




a) synthesised luminance image 




b) real disparity field 



Fig. 5. Synthetic image SynAR 
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A lowpass filtered random noise is chosen for the texture of the SynAR 
object. The surface of the object approximates a depth map which results from 
a stationary AR(1) process. Thus the disparity estimation results should confirm 
the analytically derived disparity estimation error variance distribution. The real 
disparity field has an autocorrelation coefficient of pdd = 0.995. 

The second image pair was created from a virtual 3-D object that is recon- 
structed from a real video communication scene. It is called SynLudo and is 
shown in Fig. E| The real disparity field of SynLudo shows discontinuities and is 
not stationary. Therefore the AR(l)-process can not approximate it as accurately 
as in case of SynAR. The real disparity field has an autocorrelation coefficient 
of Pdd = 0.98. 




a) synthesised luminance image b) real disparity field 

Fig. 6. Synthetic image SynLudo 



A block size of 17x17 pel was chosen for the experimental verification. In 
table 121 the estimation errors variances of the different estimators are shown. 



Table 2. Measured estimation error variances of the different estimators 



Estimation 


SynAR 


SynLudo 


One concentrically arranged block: ctJj, 

One eccentrically arranged block: cr^g 

Four averaged, eccentrically arranged blocks: o-|g 


0.4803 peF 
0.3096 pel^ 
0.2070 pel^ 


0.1214 peF 
0.0901 pel^ 
0.0698 pel^ 



The gain of the proposed estimator is computed from the measured error 
variances. In case of SynAR the following gain is optained: 



a 



2 

de,cen 



0.4803 

0.3096 



1.5514 



a 



2 

de,ecc 



0.3096 

0.2070 



1.4967 



9ecc — 
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In case of SynLudo the gain is 



9ecc — 2 



2 

^ de,cen 



^ de.t 



0.1214 

0.0901 



= 1.3474 



5 9clv — 2 



' de,ecc 



^ de.t 



0.0901 

0.0698 



= 1.2908 



Fig. 13 shows the comparison between the theoretically derived and experi- 
mentally found disparity estimation error variance distribution. 




theoretically derived SynAR SynLudo 



Fig. 7. Comparison between the theoretically derived disparity estimation error 
variance distribution inside the image blocks and the experimentally found one. 
Visualised as 3-D diagram and height contours. 



The theoretically derived and the measured error distributions are quite sim- 
ilar. The measured position of the minima almost fits to the analytically de- 
rived one. But there is a difference between the expected error variance ratio 
of gecc = 1-93 and the measured one of gecc = 1-55 in case of SynAR. This 
difference is caused by restricted accuracy of the virtuel 3-D object and error 
during the image synthesis. On the one hand, the object surface is represented 
by a triangular mesh causing piecewise linear surfaces that differ from an ideal 
AR(1) process. On the other hand the applied image synthesis approach just 
uses bilinear texture interpolation and does no anti-aliasing filtering. In case of 
SynLudo a gain of gecc = 1-35 is achieved. 

From averaging, a maximum gain of gav = 2.64 can be expected. It turns out 
that only a gain of gav < 1.5 is achieved during the experimental investigations. 
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This indicates that the error between neighbouring pels is correlated. The total 
gain of the presented estimator is therefore g = gecc • gav = 1-55 • 1.5 = 2.33. 

With real images, a subjective assessment of the estimates is carried out by 
visualising the 3-D cloud of points which results from a triangulation of corre- 
sponding image points. It turns out that the robustness is obviously improved 
due to averaging the four estimates using a simple outlier detection and removal 
that evaluates the empirical variance of the four estimates. 

6 Conclusions 

A block-based disparity estimator is proposed in this paper that considers the 
non-uniform spatial distribution of the disparity estimation error inside an image 
block. 

The presented approach is based on Buschmann’s analytical derivation of 
the spatial distribution of the displacement estimation error inside image blocks 
which are mapped into another image plane using a 2-D transformation m 
By restricting the image block transformations to horizontal epipolar geometry, 
the horizontal component of the displacements is turning into disparity and the 
vertical displacement is equal to zero. Since Buschmann assumed statistically 
independent displacement components, also the estimation error variance distri- 
bution of the horizontal displacement component corresponds to the one of the 
disparity estimate. 

For the derivation of the estimation error variance, the stochastic process for 
modelling displacement vector fields m is adapted to model disparity fields. 
The zero-mean AR(1) process is extended by an additional mean value which 
models the offset in disparity fields that origins from the stereoscopic camera 
setup. The real disparity field is thus described by its autocorrelation coefficient 
Pdd, its variance tr^ and its mean value uid- 

The dependency of the spatial distribution of the disparity estimation error 
variance inside the reference block on the parameters of the AR(1) process is 
derived in a next step. It turns out that rud does not influence the distribution 
and that tr^ scales the error variance linearly. The error distribution non-linearly 
depends on pdd- Comparing the three different investigated transformations, it is 
shown that the estimation error variance in the center of the blocks is identical 
for translational, affine and bilinear transformations. This observation seems to 
contradict to previous investigations which showed that affine transformation is 
superior to translational transformation in case of almost planar surfaces and 
an estimation technique that uses concentrically arranged image blocks [Z] ■ The 
reason for this contradiction is that the disparity field of planar surfaces, having a 
deterministic linear course, can not be modeled by an AR(1) process. An AR(1) 
process rather models disparity fields caused by more complex surfaces. 

The position of the four minima is analytically determined for affine and 
bilinear transformation. It is proportional to the chosen block size N and varies 
slightly with pdd- It varies less than 0,03A^ for 0.75 < pdd < I- In case of 
bilinear transformation the position is around (±5, ±5) and in case of affine 
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transformation around (±4, ±4) with respect to a local coordinate system in the 
center of the block and a block size of 17x17 pel. 

Based on the evaluation of the estimation error distribution, a new estimator 
is proposed. It arranges the image block eccentrically around the evaluated pel 
such that it lies in one of the minima of the estimation error variance distribution. 
Since the position of the minima is almost constant the arrangement does not 
depend on the statistic properties of the disparity field. 

The gain of the proposed estimator compared to conventional estimation 
with a concentrically arranged block is expressed by a gain factor g. The gain 
obtained by eccentrically arranging the blocks is expressed by the ratio between 
the error variance in the minimum and the error variance in the center 

of the block leading to gecc = ^le,ecc- This gain non-linearly 

depends on p^d- For common autocorrelation values around pdd = 0.995, a gain 
around (/ecc = 1-93 is obtained in case of bilinear transformation and a gain 
around 5ecc = 1-83 is achieved in case of affine transformation. Since the error 
variance is smaller in case of bilinear transformation, bilinear transformation is 
applied in the proposed estimator. 

The estimation error variance is further decreased by arithmetically averaging 
the four possible estimates. Assuming that the error of each observed pel is 
uncorrelated, the error variance decreases reciprocally with the total number 
of pels in the reference blocks when using a Maximum-Likelihood estimator. 
Because the eccentric blocks partly overlap, the number of pels increases by 
a factor of 2,64. Accordingly, a maximum additional gain oi g = 2.64 can be 
obtained by averaging the four estimates. The same gain can not be obtained 
by using a single block with the same number of pels, since this would cause 
additional model failures. 

The proposed estimator is tested in order to verify the analytically derived 
gain using synthetic and real images and a constant block size of iV = 17 pel. 
Synthetic images are created by rendering two virtual 3-D objects: one with 
a surface whose depth is created using a AR(1) process, called SynAR, and 
another object that was reconstructed from a videocommunication scene, called 
SynLudo. The disparity fields of both objects can not be modelled correctly by 
the investigated transformations. 

The analytically determined and measured error distribution are quite sim- 
ilarity. The measured position of the minima almost fits to the analytically de- 
rived one. But there is a difference between the expected error variance ratio 
of gecc = 1-93 and the measured one of gecc = 1-55 in case of SynAR. This 
difference is caused by restricted accuracy of the virtuel 3-D object and error 
during the image synthesis. On the one hand, the object surface is represented 
by a triangular mesh causing piecewise linear surfaces that differ from an ideal 
AR(1) process. On the other hand the applied image synthesis approach just 
uses bilinear texture interpolation and does no anti-aliasing filtering. In case of 
SynLudo a gain of gecc = 1-35 is achieved. 

For a further reduction of the estimation error due to averaging, a maximum 
value of gav = 2.64 can be expected. It turns out that only a gain of gav = 1-5 is 
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achieved. This indicates that the error between neighbouring pels is correlated. 
The total gain of the presented estimator is therefore g = gecc • gav = 2.33. 

With real images a subjective assessment of the estimates is carried out. 
It turns out that the robustness is obviously improved due to averaging the 
four estimates using a simple outlier detection and removal that evaluates the 
empirical variance of the four estimates. Further improvements are expected but 
not yet verified by a more appropriate averaging of the estimates, considering 
disparity discontinuities and individual disparity estimation errors. 
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Beyond the Epipolar Constraint: Integrating 3D 
Motion and Structure Estimation’^ 
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Abstract. This paper develops a novel solution to the problem of re- 
covering the structure of a scene given an uncalibrated video sequence 
depicting the scene. The essence of the technique lies in a method for 
recovering the rigid transformation between the different views in the im- 
age sequence. Knowledge of this 3D motion allows for self-calibration and 
for subsequent recovery of 3D structure. The introduced method breaks 
away from applying only the traditionally used epipolar constraint and 
introduces a new constraint based on the interaction between 3D motion 
and shape. 

Up to now, structure from motion algorithms proceeded in two well de- 
fined steps, where the first and most important step is recovering the rigid 
transformation between two views, and the subsequent step is using this 
transformation to compute the structure of the scene in view. Here both 
aforementioned steps are accomplished in a synergistic manner. Existing 
approaches to 3D motion estimation are mostly based on the use of optic 
flow which however poses a problem at the locations of depth disconti- 
nuities. If we knew where depth discontinuities were, we could (using a 
multitude of approaches based on smoothness constraints) estimate ac- 
curately flow values for image patches corresponding to smooth scene 
patches; but to know the discontinuities requires solving the structure 
from motion problem first. In the past this dilemma has been addressed 
by improving the estimation of flow through sophisticated optimization 
techniques, whose performance often depends on the scene in view. In 
this paper the main idea is based on the interaction between 3D motion 
and shape which allows us to estimate the 3D motion while at the same 
time segmenting the scene. If we use a wrong 3D motion estimate to com- 
pute depth, then we obtain a distorted version of the depth function. The 
distortion, however, is such that the worse the motion estimate, the more 
likely we are to obtain depth estimates that are locally unsmooth, i.e., 
they vary more than the correct ones. Since local variability of depth is 
due either to the existence of a discontinuity or to a wrong 3D motion es- 
timate, being able to differentiate between these two cases provides the 
correct motion, which yields the “smoothest” estimated depth as well 
as the image locations of scene discontinuities. Although no optic flow 
values are computed, we show that our algorithm is very much related 
to minimizing the epipolar constraint when the scene in view is smooth. 
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When however the imaged scene is not smooth, the introduced constraint 
has in general different properties from the epipolar constraint and we 
present experimental results with real sequences where it performs bet- 
ter. 



Keywords: Structure from motion, 3D motion estimation, shape segmentation, 
epipolar constraint, self-calibration 



1 Introduction and Motivation 

One of the biggest challenges of contemporary computer vision is to create robust 
and automatic procedures for recovering the structure of a scene given multiple 
views. This is the well known problem of structure from motion (SFM) linni . 
Here the problem is treated in the differential sense, that is, assuming that a 
camera moving in an unrestricted rigid manner in a static environment contin- 
uously takes images. Regardless of particular approaches, the solution always 
proceeds in two steps: first, the rigid motion between the views is recovered and, 
second, the motion estimate is used to recover the scene structure. 

Traditionally, the problem has been treated by first finding the correspon- 
dence or optic flow and then optimizing an error criterion based on the epipolar 
constraint. Although considerable progress has been made in minimizing devia- 
tion from the epipolar constraint , the approach is based on the values 

of flow whose estimation is an ill-posed problem. 

The values of flow are obtained by applying some sort of smoothing to the 
locally computed image derivatives. When smoothing is done in an image patch 
corresponding to a smooth scene patch, accurate flow values are obtained. When, 
however, the patch corresponds to a scene patch containing a depth discontinuity, 
the smoothing leads to erroneous flow estimates there. This can only be avoided 
if a priori knowledge about the locations of depth discontinuities is available. 
Thus, flow values close to discontinuities often contain errors (and these affect 
the flow values elsewhere) and when the estimated 3D motion (containing errors) 
is used to recover depth, it is unavoidable that an erroneous scene structure will 
be computed. The situation presents itself as a chicken-and-egg problem. If we 
had information about the location of the discontinuities, then we would be able 
to compute accurate flow and subsequently accurate 3D motion. Accurate 3D 
motion implies, in turn, accurate location of the discontinuities and estimation 
of scene structure. Thus 3D motion and scene discontinuities are inherently re- 
lated through the values of image flow and the one needs the other to be better 
estimated. Researchers avoid this problem by attempting to first estimate flow 
using sophisticated optimization procedures that could account for discontinu- 
ities, and although such techniques provide better estimates, their performance 
often depends on the scene in view, they are in general very slow and require a 
large number of resources m m\ - 
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In this paper, instead of attempting to estimate flow at all costs before pro- 
ceeding with structure from motion, we ask a different question: Would it be 
possible to utilize any available local image motion information, such as normal 
flow for example, in order to obtain knowledge about scene discontinuities which 
would allow better estimation of 3D motion? Or, equivalently, would it be possi- 
ble to devise a procedure that estimates scene discontinuities while at the same 
time estimating 3D motion? We show here that this is the case and we present a 
novel algorithm for 3D motion estimation. The idea behind our approach is based 
on the interaction between 3D motion and scene structure that only recently has 
been formalized ^ . If we have a 3D motion estimate which is wrong and we use 
it to estimate depth, then we obtain a distorted version of the depth function. 
Not only do incorrect estimates of motion parameters lead to incorrect depth es- 
timates, but the distortion is such that the worse the motion estimate, the more 
likely we are to obtain depth estimates that locally vary much more than the 
correct ones. The correct motion then yields the “smoothest” estimated depth 
and we can define a measure whose minimization yields the correct egomotion 
parameters. The measure can be computed from normal flow only, so the com- 
putation of optical flow is not needed by the algorithm. Intuitively, the proposed 
algorithm proceeds as follows: first, the image is divided into small patches and 
a search — for the 3D motion — which as explained in Section 0 takes place 
in the two-dimensional space of translation directions — is performed. For each 
candidate 3D motion, using the local normal flow measurements in each patch, 
the depth of the scene corresponding to the patch is computed. If the variation 
of depth for all patches is small, then the candidate 3D motion is close to the 
correct one. If, however, there is a significant variation of depth in a patch, this 
is either because the candidate 3D motion is inaccurate or because there is a 
discontinuity in the patch. The second situation is differentiated from the first 
if the distribution of the depth values inside the patch is bimodal with the two 
classes of values spatially separated. In such a case the patch is subdivided into 
two new ones and the process is repeated. When the depth values computed in 
each patch are smooth functions, the corresponding motion is the correct one 
and the procedure has at the same time given rise to the location of a number 
of discontinuities. The rest of the paper formalizes these ideas and presents a 
number of experimental results. 



1.1 Organization of the Paper 

Section 0 defines the imaging model and describes the equations of the motion 
held induced by rigid motion; it also makes explicit the relationship between dis- 
tortion of depth and errors in 3D motion. Section0is devoted to the description 
of the algorithm and it also analyzes the introduced constraints and formalizes 
the relationship of the approach to algorithms utilizing the epipolar constraint. 
Section 0 reviews the self-calibration method used in the paper and Section 0 
describes a number of experimental results with real image sequences. 
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2 Preliminaries 



We consider an observer moving rigidly in a static environment. The camera is 
a standard uncalibrated pinhole with internal calibration parameters described 
by matrix K, 

( fx s \ 

0 fy^y] 

0 0 // 

The coordinate system OXY Z is attached to the camera, with Z being the 
optical axis. 

Image points are represented as vectors r = [x, y, /]"*", where x and y are the 
image coordinates of the point and / is a chosen constant of the same magnitude 
as X, y. A scene point R is projected onto the image point 



r = 



KR 
R • z 



( 1 ) 



where z is the unit vector in the direction of the Z axis. 

Let the camera move in a static environment with instantaneous translation t 
and instantaneous rotation oj (measured in the coordinate system OXY Z). Then 
a scene point R moves with velocity (relative to the camera) 



R = — t — w X R 



( 2 ) 



The image motion field is then |3] 

r = - ’^^^ +7(zx(rx(K[a;]xK~^r))) = iutr(Kt)+u„t(K[u)]xK"^) 

/(R-z) / ^ 

(3) 

where Z is used to denote the scene depth (R • z), and Utr, Urot the direction of 
the translational and the rotational flow respectively. 

The rotational component Umt is determined by the matrix A = K[a;]xK“^ 
with seven degrees of freedom. As shown in P|, for a given translation t', ma- 
trix A can be decomposed into 

A = Ac -t- At = Ac -I- / t'w’’’ -I- wqI (4) 

Matrix Ac (also called copoint matrix) depends on five independent parame- 
ters and is the component of A that can be estimated (together with the direction 
of Kt) from a single flow field. The vector w determines the plane at infinity 
and it cannot be obtained from a single flow field. Finally, wq can be computed 
from the condition trace A = 0. 

For a hypothesized epipole t', the copoint matrix Ac is a suitable parameter- 
ization of the remaining parameters of the instantaneous fundamental matrix; 
t' and Ac define the epipolar geometry of the instantaneous motion HD. 

Also, due to linearity, we have: 

U,.ot(A) = Urot(Ac) -k Urot(/tw'^) -k Urot(wol) = Urot(Ac) -k (w • r)Utr(t) 
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i.e., the rotational flow due to At is equal to the translational flow of a certain 
scene plane. 

In the sequel, t is used to denote an estimate of the apparent translation Kt 
and Ac denotes an estimate of the copoint matrix Ac. 



2.1 Depth Estimation from Motion Fields 



This section introduces the novel criterion of “smoothness of depth” which is 
used to evaluate the consistency of the normal flow field with the estimated 3D 
motion and also to segment the scene at its depth boundaries. The idea is based 
on the interrelationship of estimated 3D motion (t, Ac) and estimated depth of 
the scene Z. 

The structure of the scene, i.e., computed depth, can be expressed as a func- 
tion of the motion parameters. Suppose we have estimated the apparent trans- 
lation t and the copoint matrix Ac (for that particular t). 

At an image point r where the normal flow direction is n, we can compute 
inverse depth (up to an projective ambiguity ini) 

J_ _ f ■ n - Urot(Ac) ■ n 
Z' Utr(t) • n 

Substituting into (0 from (0 and we obtain: 



1 i-Utr(t) • n - Urot(^Ac) • H 

— = ^ [- w • r 

Z' Utr(t) • n 



(6) 



where Ui.ot(<5Ac) is the rotational flow due to the rotational error <5Ac = (Ac — 
Ac). Notice that when t and Ac are correct estimates of t and Ac respectively, 
we have 

1 1 

— = — -I- w • r 
Z’ z 

Obviously, it is not enough to recover w to obtain Euclidean scene reconstruction, 
we still need the calibration matrix K. 

For incorrect motion estimates, we obtain distorted estimates of the scene 
depth and the amount of distortion depends on the normal flow direction n. 
The larger the angle between vectors Utr(t) and iutr(t) — Urot(<5Ac), the more 
the distortion will be spread out over the different directions. Thus, considering 
a patch of a smooth surface in space and assuming that normal flow measure- 
ments are taken along many directions, a rugged (i.e., unsmooth) surface will be 
computed on the basis of wrong 3D motion estimates. 

The above observation constitutes the main idea behind our algorithm. For 
a candidate 3D motion estimate we evaluate the smoothness of estimated depth 
within image patches, compensating for the unknown linear function w • r. If 
the chosen image patches correspond to smooth 3D scene patches the correct 
3D motion will certainly give rise to the overall smoothest image patches. To 
obtain such a situation we attempt a segmentation of the scene on the basis of 
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estimated depth while at the same time testing the candidate 3D motions. As 
all computations are based on normal flow and we thus have available the full 
statistics of the raw data, a good segmentation generally is possible. 

3 Estimation of Viewing Geometry and Segmentation 

To measure estimated depth smoothness, it is certainly possible to utilize the 
inverse depth directly and evaluate its variation over small image regions. How- 
ever, we may run into instability problems, as division by small Utr-n can greatly 
amplify measurement errors. 

To address such potential problems, we improve the method of |5] and formu- 
late a criterion in terms of image measurements. Instead of computing variation 
of estimated depth, we assume that the depth is smooth, i.e., in a small image 
region, we choose the smooth depth that best corresponds to the image mea- 
surements. The goodness of the fit is the error measure for that image region. 

Consider a small image region TZ. We define a measure that is small if there 
exists a smooth estimated inverse depth that, combined with the estimated mo- 
tion parameters, matches the image measurements well. 

It is not sufficient to assume a constant inverse depth because of the unknown 
parameters. We approximate the inverse depth 1/Z' in the region by a linear 
function, 1/Z' = z-r (note that the third component of r is a constant /, so z-r is 
a general linear function in the image coordinates). As the unknown component 
of inverse depth is a linear function w • r, it can be incorporated into z • r and 
the smoothness criterion thus does not depend on the unknown parameters. 

In the region TZ that contains a set of measurements r'i with directions we 
can define 

6>o(t, Ac, z, 7^) = ^ f r'i • rii - (z • ri)(utr(t) • rii) - Urot(Ac) • rii j (7) 

i 

Minimization of 6>o with respect to z provides the best smooth depth. We 
have 

= ^(z-ri)(utr(t)-ni)^ri-^(rVn-Urot(Ac)-nj)(utr(t)-ni)ri = 0 (8) 
i i 

a set of three linear equations for the three elements of z. Substituting the 
solution of ® into Q, we obtain 6>i(t, Ac,77.), a second order function of Ac. 
Notice that the computation can be performed even when Ac is not known. 

To estimate Ac, we sum up all the local functions and obtain a global func- 
tion: 

02(t,Ac) = ^0i(t,Ac,i^) (9) 

n 

Finally, global minimization yields the best copoint matrix Ac and also a 
measure of depth smoothness for the apparent translation t: 

(p(t) = min 02 (tj Ac) 



( 10 ) 
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3.1 Algorithm Description 

The epipole is found by localizing the minimum of function <?(t) described above. 
We first summarize the computation of ^(t) and then explain the algorithm. To 
obtain <?(t): 

1. Partition the image into small regions, in each region compute 0o(t, Ac, z, TZ) 
and perform local minimization of z (the computation is symbolic in the 
unknown elements of Ac). The minimum is denoted by 0i(t, Ac,7^). 

2 . Add all the local functions 6 >i (t, Ac, 7^) and minimize the resulting 6 > 2 (t, Ac) 
to obtain Ac. 

3. ^(t) is the minimum of 6 > 2 (t, Ac). 

Smoothness must only be enforced in image patches corresponding to smooth 
parts of the scene. Thus, once we have obtained preliminary estimates of ^(t) 
and Ac, we compute the inverse depth and try to perform depth segmentation. 
A detailed description of the segmentation algorithm is given in the next section. 

After the segmentation, we recompute ^(t) by enforcing smoothness only 
within image regions that do not contain depth discontinuity. Notice that it is 
not necessary to recompute 0i for all the image regions as we can locally compute 
the change of 6>i for the regions that are segmented. The improved 6>2(t,Ac) 
then provides new Ac and a new value of ^(t). 

The edge detection and re-computation process could be iteratively repeated, 
but in practice a single iteration seems to be sufficient, as we explain in the next 
section. 

One may ask whether depth segmentation using an incorrect copoint matrix 
(due to incorrect information from regions with depth discontinuities) can actu- 
ally improve the estimation of the viewing geometry. The answer is yes, almost 
always. For most scenes, a majority of the image regions correspond to smooth 
scene patches and should yield the correct copoint matrix. Regions with depth 
discontinuities often provide the correct solution or at least relatively small errors 
for the correct motion. Such regions may give better results for some incorrect 
motion, but usually different motions would be the best for different patches. 

Computation of Ac uses information from the whole image and when we 
ignore depth discontinuities in the first stage of the algorithm, we may bias the 
solution, but in most cases not by much. If the estimated copoint matrix is 
relatively close to the true Ac, the major depth discontinuities (and those are 
the ones we are most interested in) should certainly appear in the segmentation. 

To find the minimum of ^ and thus the apparent translation, we perform 
a hierarchical search over the two-dimensional space of epipole positions. In 
practice, the function is quite smooth, that is small changes in t give rise to 
only small changes in <P. One of the reasons for this is that for any t, the value 
of ^(t) is influenced by all the normal flow measurements and not only by a 
small subset. 

For most sequences, the motion of the camera does not change abruptly and 
we can use the computed epipole as a starting point of the search in the next 
frame. 
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3.2 Patch Segmentation 

Given an inverse depth map computed using (t,Ac), we apply a Canny edge 
detector with special precautions taken to handle sparse input data. However, 
we also need to take into account the unknown linear term w • r. The term 
cannot be estimated, but to improve the edge detection process, we add a linear 
function to the inverse estimated depth so that the average depth in different 
parts of the image is approximately the same. 

By segmenting a patch, we decrease its contribution to ^(t). Ideally, an image 
patch should be segmented if it contains two smooth scene surfaces separated by 
a depth discontinuity and the depth is estimated using the correct 3D motion. It 
would be the best to not segment any patches when the 3D motion is incorrect 
and no depth discontinuities are present, but this of course cannot be guaranteed. 

We use a simple segmentation criterion. If the edge detection algorithm finds 
an edge that divides an image region into two coherent subregions, we split that 
region. On the other hand, we do not split regions containing more complicated 
edge structure (that may be expected for a distorted depth estimates, since 
distortion depends on the normal flow direction). 

It is simple to show that this strategy can be expected to yield good patch 
segmentation. When the motion estimate is correct, also the depth estimates 
are correct and patches with large amount of depth variation contain depth 
discontinuities. 

For incorrect motions, the distortion factor depends on the direction of nor- 
mal flow. While for any patch we can split the depth estimates into two groups 
and decrease the error measure, it is highly unlikely that the two groups of 
measurements define two spatially coherent separate subregions. Consequently, 
several edges can be expected in such a patch and it will not be split by the 
algorithm. 

However, as the segmentation is based on local information only, we cannot 
expect it to perfectly distinguish the depth discontinuities from depth variation 
due to incorrect 3D motion. Sometimes, if a patch contains two subregions with 
different distributions of normal flow directions, we may split the patch even for 
incorrect 3D motion. An improvement, however, could be achieved by taking 
into account also the distribution of normal flow directions in relation to the 
estimated depth. 

The local results are used in a global measure and occasional segmentation 
errors are unlikely to change the overall results. For the segmentation to cause 
an incorrect motion to yield the smallest ^(t), special normal flow configurations 
would have to occur in many patches in the image. 



3.3 Algorithm Analysis 

A single image region TZ contributes the value 6>i(t, Ac,7^) to the global crite- 
rion. It is shown below that the function can be decomposed into two parts; the 
first part is a multiple of the epipolar error and thus measures the error of the 
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motion estimate. The second part is equal to the residual of the least squares 
fitting of optical flow to the measurements in the region. 

The vectors Utr(t) and Urot(Ac) are polynomial functions of image position r 
and can usually be approximated by constants within a small image region. To 
simplify the analysis we use a rotated local coordinate system so that 

^tr(t) — [1,0,0] , Ui.ot(Ac) — [Urx, Ury , 0] 

u^.=Yi-Wi, Hi = [cosV'i,sin'(/'i,0]'’’ 

We can then rewrite (CD as 

6>o = ^ VTi (uni - (z • Ti) cos ipi - Urx COS ipi - Ury sin 

i 

Note that Urx can be incorporated into z (writing z' = z + [0, 0, ^rx//]"^ 
thus obtain the same minimum for the simplified: 

00 = ^ Wi (Uni - (z • Ti) COS'i/'i - Ury sin 
i 

Now consider the least squares estimation of optical flow in the region using 
weights Wi- Allowing linear depth changes, in the local coordinate system we fit 
flow (ux • r, Uy), i.e., a linear function along the direction of Utr(t) and a constant 
in the perpendicular direction. We would minimize 

(uiii - (ux • ri)cosV'i - UySinV’i)^ (14) 

i 

Expressions JED and dm are almost identical, but there is one important dif- 
ference. The optical flow minimization dm is strictly local, using only measure- 
ments from the region. On the other hand, in dm, the rotational flow (urx, Ury) 
is determined by the global motion parameters. 

Let us denote the least squares solution of dm as (ux,Uy) and the residual 
as E-p. After some vector and matrix manipulation we can obtain 

01 = (mss - mJgM’^mcs) <5ury E Ep = K Ju^y + Ep (15) 

where 

mss = sin^V'i, = '^Wi cos 'ipi sin ipiVi, Mcc = ^ cos^ 

i i i 

and Sury = Ury — Uy Is the difference of the globally determined rotational com- 
ponent Ury and the best local optical flow component tty. Both of the components 
are in the direction perpendicular to the translational flow and Su^y is therefore 
the epipolar distance as shown in detail in the following section. 

Equation dm provides an interpretation of the criterion used in this paper. 
The component Ep, evaluates whether the depth in the region is smooth. The 
first term K <5Ury, is a product of the squared epipolar distance and a factor K 



( 11 ) 

( 12 ) 
and we 
(13) 
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that only depends on the geometric configuration of measurements within the 
region. The complete analysis is complicated by the fact that K also depends on 
the point positions. One simple observation is the boundedness of if: matrix Mcc 
is positive definite, so 0 < if < mss < 

Since the image regions are small, some intuition can be obtained by assuming 
that all the measurements are taken at a single point. Then both matrix Mcc 
and vector mcs simplify into scalars and if becomes 

Applying the Cauchy-Schwartz inequality to the numerator, we see that ex- 
pression m can only be zero if sequences ^/Wl sin ipi, , ^/Wn sin ipn and 
\fW\ cos ijji, ... , \JWn cos ipn are proportional. If we use weights Wi that only 
depend on the angle ipi, the sequences are proportional if all the normal flow 
directions are the same. Thus a small range of normal flow directions yields a 
small if. There is only one special case, if reaches the maximum Ei when 
all the normal flow measurements are perpendicular to the estimated transla- 
tional flow, so that sin Ipi = 1. This is a desirable property. The epipolar error 
depends on the projection of the flow onto the direction perpendicular to the 
translational flow. If all the measurements are in that direction, they provide a 
maximum amount of information about the needed flow projection. 

In general, if measures the range of normal flow directions within the re- 
gion while preferring measurements that provide more information about the 
epipolar error. Compared to the epipolar constraint, the depth smoothness mea- 
sure for smooth patches emphasizes regions with larger variation of normal flow 
directions and can thus be expected to yield better results for noisy data. 

Finally, let us examine the behavior of 0\ for a smooth scene patch. Ignoring 
noise, we obtain Ep = 0 and the estimated (tix,Uy) is equal to the true motion 
field vector. For any translation, we can make 0i zero by choosing a rotation 
that yields Su^y = 0. But the rotation is not determined locally! 

For the correct translation, the global rotation estimate should be correct and 
we should obtain zero 0\ for all the smooth patches. Now consider an incorrect 
translation candidate. It is easy to find a rotation to make 6>i zero for one or 
several smooth patches. But if we are able to find a rotation that yields zero 6>i 
for many different patches, we can obtain exactly the same motion field for two 
different 3D motions and the scene (or large parts of it) has to be close to an 
ambiguous surface m- 

Thus except for ambiguous surfaces we should obtain the correct 3D motion 
if most of the regions used correspond to smooth patches. 

3.4 The Epipolar Constraint 

The depth smoothness measure is closely related to the traditional epipolar 
constraint and we examine the relationship here. In the instantaneous form the 
epipolar constraint can be written as 

(z X Utr(t)) • (r - Urot(Ac)) = 0 



(17) 
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Note that we can use Ac instead of A, since the flow Urot(A — Ac) is parallel to 
the translational flow Utr(t) and therefore (z x Utr(t)) • Urot(A — Ac) = 0. 

Usually, the distance of the flow vector r from the epipolar line (determined 
by Utr(t) and Urot(Ac)) is computed, and the sum of the squared distances, i.e.. 



((Z X Utr(t)) • (r - Ucot(Ac)) 



is minimized. 

Methods based upon m suffer from bias 0, however, and a scaled epipolar 
constraint has been used to give an unbiased solution: 



E 



Z X Utr(t)) • (r - Urot(Ac)) 

FMt)P 



(19) 



Again we use the coordinate system and notation of ( II I |l . Suppose that the 
flow vector r has been obtained by minimization of d and write it as {ux, Uy). 
Substituting into we obtain 



Z X Utr(t)) • (r - Ucot(A( 

llUtr(t)P 



— {Uy Wry) — SUj.y 



(20) 



Equations (I2DI) and JED illustrate the relationship of the epipolar constraint 
(albeit possibly using non-standard weights to estimate flow) and the smoothness 
measure 0\. 



4 Self-Calibration 

So far we have shown how to estimate the apparent translation t and the copoint 
matrix Ac. To perform camera self-calibration, and thus subsequently derive 
structure, we can use the method developed in [ 2 ], combining the partial infor- 
mation assuming that the internal camera parameters are constant throughout 
the image sequence. 

According to J2D , the rotational component of the motion held is determined 
by matrix A = K[u)]xK“^. Matrix [a;]x is skew-symmetric, i.e., [wjx -I- [wjj = 0. 
This is the constraint we use, expressed in terms of K and A as: 

K-^AK -b (K-^AK)"^ = 0 (21) 

Suppose we have a set of copoint matrices Ad and based on i|) we solve 
for Wo using trace A = 0. We may write 

Ai = Ad + fiiw^ - ^trace (Aci -b /bw^)! 

The error measure is based on (EH): 



£(K) = y] ||K-1(A,)K + (K-i(A,)K)Tf 



( 22 ) 
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Function f (K) can be minimized with respect to all the unknown vectors 
(it is a second order function of the unknown vectors), yielding an error measure 
in terms of the calibration parameters alone. Levenberg-Marquardt minimization 
is then used to obtain the calibration parameters. For details, the reader is 
referred to 0. 

5 Experimental Results 

We present various experiments testing the different aspects of the introduced 
method, namely the competence of the technique to extract depth edges and a 
comparison between the smoothness measure and the epipolar constraint. The 
optical flow used in the epipolar error minimization was computed by the method 
of Lucas and Kanade dU, as implemented in |I]. 

One sequence used was the Yosemite fly-through sequence (one frame is 
shown in Figure 0 a)). The images also show independently moving clouds and 
each image frame was thus clipped to contain only the mountain range. One 
frame from an indoor lab sequence is shown in Figure [D^b). 




Fig. 1. (a) One frame of the Yosemite sequence. Only the bottom part of the 
image was used, (b) One frame of the lab sequence. 



Experiment 1. We show examples of the depth estimation and segmentation 
process for the lab sequence, specifically for the frame in FiguredKb). In Figure El 
the inverse estimated depth is displayed for the correct translation (as estimated 
by the algorithm) and for one incorrect translation. The corresponding depth 
segmentation results are shown in Figure 0 

Experiment 2. For the Yosemite sequence, both our method and the epipolar 
minimization perform quite well. The known epipole location in the image plane 
was (0,-100). The estimated epipole locations for different methods are sum- 
marized in Table 1. 





Beyond the Epipolar Constraint 121 




Fig. 2. (a) Inverse depth estimated for the correct epipole ((397,-115) pixels 
from the image center), (b) Inverse depth for an incorrect FOE ((—80, 115) pix- 
els from image center). The grey-level value represents inverse estimated depth 
with mid- level grey shown in places where no information was available, white 
representing positive \ jZ and black representing negative \ jZ. 



method 


epipole 


ground truth 
epipolar minimization 
^(t) (no segmentation) 
^(t) (inch segmentation) 


(0.0,-100.0) 
(0.5, -98.8) 
(2.4, -96.7) 
(0.0,-103.6) 



Table 1. Estimated epipole locations for the Yosemite sequence 



Experiment 3. The lab sequence contained several significant depth disconti- 
nuities and for the majority of frames our method performed better than the 
epipolar minimization. No ground truth was available, but we visually inspected 
the instantaneous scene depth recovered. Out of a 90 frame subsequence tested 
with both methods, the epipolar minimization yielded 25 frames with clearly 
incorrect depth (i.e., many negative depth estimates or reversed depth order 
for large parts of the scene). The performance of our method was significantly 
better, as only 7 frames yielded clearly incorrect depth. 

Experiment 4- The lab sequence was taken by a hand-held Panasonic D5000 
camera with a zoom setting of approximately 12mm. Unfortunately, the effective 
focal length of the pinhole camera model was also influenced by the focus setting 
and we thus knew the intrinsic parameters only approximately. The internal 
parameters were fixed and approximately: fx = fy = 450, = Ay = s = 0. 

The focal lengths were slightly overestimated, but consistent for different parts 
of the sequence. Calibration results are summarized in Table 2. 
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Fig. 3. Depth edges (drawn in black) found for the correct translation, over- 
laid on the computed depth, (a) Correct translation (397,-115). (b) Incorrect 
translation (—80, 115). 



frames 


f. 


fy 




Ay 


s 


001-300 


536 


522 


16 


26 


3 


001-100 


541 


543 


-33 


6 


-25 


101-200 


544 


475 


26 


-38 


14 


201-300 


548 


513 


-11 


8 


6 



Table 2. Self-calibration results for the lab sequence. 



6 Conclusions 

There exists a lot of structure in the world. With regard to shape, this structure 
manifests itself in surface patches that are smooth, separated by abrupt discon- 
tinuities. This paper exploited this fact in order to provide an algorithm that 
estimates 3D motion while at the same time it recovers scene discontinuities. 

The basis of the technique lies in the understanding of the interaction be- 
tween 3D shape and motion. Wrong 3D motion estimates give rise to depth 
values that are locally unsmooth, i.e., they vary more than the correct ones. 
This was exploited in order to obtain the 3D motion that locally provides the 
“smoothest” depth while recovering scene discontinuities. Finally, it was shown 
that the technique is very much related to epipolar minimization since the func- 
tion to be minimized in image areas corresponding to smooth scene patches takes 
the same values as deviation from the epipolar constraint. 

Although the properties of the new constraint are not fully developed, it 
appears to have a different behavior from the epipolar constraint for image ar- 
eas corresponding to non-smooth patches apparently causing p| less confusion 
between translation and rotation and thus producing more accurate results. It 
was generally believed in the computer vision community that a good algorithm 
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for structure from motion is one that doesn’t involve the scene in view. This is 
probably the reason behind the prominence of the epipolar constraint where one 
recovers the viewing geometry from image flow or correspondences without any 
reference to the scene. The present paper demonstrates that this may not be 
the case. The recovery of epipolar geometry is inherently connected to the scene 
in view and the best ways to achieve structure from motion may be the algo- 
rithms that recover motion and structure in a synergistic manner. Further study 
of the introduced constraint opens new avenues for research on the problem of 
recovering structure from image sequences, and its associated applications. 
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Abstract. In this paper we present our global approach to accurate 3D 
reconstruction with a calibrated multi-camera system. In particular, we illustrate 
a simple and effective adaptive technique for the self-calibration of CCD-based 
multi-camera acquisition systems. We also propose a general and robust 
approach to the problem of close-range partial 3D reconstruction of objects 
from stereo-correspondences. Einally, we introduce a method for performing an 
accurate patchworking of the partial reconstructions, based on 3D feature 
matching. 

1. Introduction 

In the past few years, there has been a fast proliferation of methods for the 3D 
reconstruction of objects from the analysis of camera images. A large number of these 
applications are aimed at the problem of content creation for the market of multi- 
media applications. There is a considerable number of applications, however, in 
which the accuracy of the 3D reconstruction plays a crucial role. For example, 
applications of close-range digital photogrammetry aimed at the preservation and 
restoration of 3D works of art require effective methods for accurate, quantitative, 
reproducible and repeatable 3D reconstruction. In this case, in fact, suitable 3D 
modeling methods should be sufficiently accurate as to match the performance of the 
methods that are commonly adopted for the 3D relief of works of art; and to guarantee 
that such measurements will be reproducible and can be repeated along time for 
monitoring purposes. 

The most popular non-invasive approaches to 3D reconstruction of mid-sized objects 
are based on stereo-correspondences. Such methods are based on the detection of 
features (e.g. points, edges, luminance profiles) on the available images of the object. 
When the camera parameters (position, orientation and other intrinsic physical 
parameters) are known {calibrated case), the process of determining the 
correspondences is helped by some rigidity constraints such as the coplanarity of 
corresponding visual rays (epipolar constraint), and the 3D coordinates of the features 
can be determined through geometric triangulation [1,2]. When, on the contrary, the 
camera parameters are not known {uncalibrated analysis), the determination of the 
feature correspondences becomes more complex as it can only rely on projective 
constraints and invariants. Several robust matching techniques have been developed 
for uncalibrated acquisitions [19]; such methods are usually based on the progressive 
application of a variety of projective constraints on sets of uncalibrated views, in 
order to narrow-down a list of candidate matches (generated through a correlation- 
based approach) to a final set of confidence matches. 
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In general, the 3D reconstruction methods based on feature matching can be classified 
into two categories: 

• monocular approach: a number of uncalibrated views are acquired and 
processed all together (global approach) or in subgroups (local approach) in 
order to jointly estimate camera motion and object structure. In the global 
approach, one or more cameras are employed for acquiring a number of images 
of the object from a variety of viewpoints [6]. The pose of the cameras and the 
3D coordinates of the features are found through a joint analysis of the image 
features extracted from all the available views. In the local approach a video 
sequence of the object is acquired in such a way to “cover” all portions of the 
object. Then the views are partitioned into subgroups to be processed separately 
using uncalibrated methods based on projective invariants and constraints. 

• multi-ocular approach: a set of cameras is mounted on a rigid support and 
calibrated, so that all camera parameters are known beforehand. Several multi- 
ocular views of the object are acquired from a variety of viewpoints. From the 
analysis of each multi-view a “local” surface is generated. All local surfaces are 
then fused together into a single one, by using some global constraints [6,7,8]. 

In general, the global monocular approach estimates the 3D coordinates of some 
object features with best accuracy. Due to its global treatment of the data, however, it 
tends to produce a sparse set of 3D features that cannot be easily interpolated into a 
global surface unless some a-priori information on the object is available. Partitioning 
the views into smaller groups for a more “local” approach makes it easier to deal with 
the complexity of the surface topology but generally causes a reduction of the 
accuracy and is quite difficult to perform on an automatic basis. This partitioning 
becomes necessary when dealing with video sequences, but the subgroups of views 
tend to be “aligned” with each other, which is not the optimal positioning for feature 
matching purposes. On the other hand, acquiring a monocular sequence is certainly 
the simplest way to perform an acquisition campaign. 

The local multi-camera approach, exhibits some interesting characteristics: 

• a multi-camera acquisition system induces a “natural” partition of the views, 
which becomes optimal when the cameras are well-positioned on the rigid frame; 

• the acquisition system can be quite easily calibrated, and the estimated 
parameters can be used for validating all feature matches; the calibration can be 
made adaptive in order to compensate for the drift of the parameters throughout 
the acquisition process; 

• the accuracy of a well-calibrated system is at photogrammetric level; 

• each calibrated triplet generates a “local” surface patch of modest topological 
complexity; all patches can be merged into a more complex global surface 
through “patch working”. 

In this article we illustrate our calibrated reconstruction approach based on adaptive 
self-calibration, local stereo-matching approach and global patchworking, with the 
goal of obtaining a high-accuracy reconstruction of unstructured 3D objects. 
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2. Calibration 

All calibrated 3D reconstruction methods are critically dependent on the accuracy 
with which the camera parameters, i.e. the geometrical, optical and electric 
characteristics of the camera system (camera position and orientation, focal length, 
pixel size, location of the optical center, nonlinear distortion coefficients, etc.) are 
known. In the past few years several approaches to the calibration problem have been 
proposed. Such methods apply to electronic cameras the same techniques that were 
traditionally used for the calibration of photogrammetric cameras [9,10,11]. The 
camera characteristics are, in fact, computed through a proper processing of the image 
of a test object (calibration target-frame) placed in the scene. The accuracy of the 
camera model can be arbitrarily improved by employing an adequate number of 
parameters therefore, when the goal is that of improving the calibration accuracy as 
much as possible, the pattern accuracy becomes the major bottleneck. For this reason, 
we developed an advanced photogrammetric method that jointly estimate the camera 
parameters and the geometry of the calibration target-set in a more accurate fashion 
(self-calibration). This method is based on a multi-camera, multi-view calibration 
approach, and performs an accurate self-calibration on the multi-camera system from 
the analysis of several views of a simpler calibration target-frame, such as a marked 
planar surface (a printed sheet of paper glued on a glass surface) or some other even 
simpler structure. In fact, not only is this technique able to estimate the camera 
parameters, but it can also determine the 3D position of the targets on the calibration 
frame, which can be just roughly known or not known at all. Finally, we developed 
method for making the calibration robust against the inevitable parameter drift that 
takes place during the acquisition process. Such method detects and tracks some 
“safe” features that are naturally present in the scene, and use their image coordinates 
for making the calibration process adaptive. 

2.1 Calibration Strategy 

The camera model allows us to compute the image coordinates of the projection of an 
object point P, given its coordinates. This model can also be seen as a function g that 
maps a point m in the model space into a point d in the observation space. This set of 
equations is called direct model. For each fiducial point F, and for each camera Cj, the 
direct model provides us image coordinates in the form 

d,.j = g(P,,c)= g(m, j ) ; d, . = [x, y, ]; m,. , = [ /] c\ 

where C, is the set of the 1 1 parameters which define the model of the y-th camera 

C, = [t|),6,\|/, t^,ty,t^,f, k^,k^, Cy,Cy\ , 

(|), 9 and \|/ are the 3 angles that characterize the rotation matrix R, T={tx,ty,tz] is the 
translation vector, / is the focal length of the lens, ks and are the radial lens 
distortion coefficients, OC={cx,Cy] is the optical center on the image plane. R and T 
are usually referred to as extrinsic parameters, while the other five are called intrinsic 
parameters, as they characterize the camera. Rewriting the above equations in matrix 
form and extending them to all the considered fiducial points and all the cameras, the 
direct model becomes 
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g(-): ^ 

m hA d = g(m) 

gO being the nonlinear function that maps the 3D world coordinates of N fiducial 
points, given the camera parameters of V cameras, into the V sets of N two- 
dimensional image coordinates of the perspective projection in each camera. 

With this formalization of the camera model, the problem of camera calibration 
becomes that of computing the model vector m by exploiting the knowledge about the 
observed data d and the direct model g( ). In other words, the solution of the problem 
is given by the inverse of the model function m=g '(d), where d and g() are known. 
This corresponds to the classical formulation of an inverse problem. In this sense, the 
camera calibration problem is a typical inverse problem. 

Multi-view, multi-camera calibration - It is well-known that, in order to obtain 
satisfactory and reliable calibration results, it is necessary for the fiducial points to 
“fill up” the entire scene space. This fact usually forces to construct 3D calibration 
patterns that are as large as the object to be reconstructed. In this case, calibration is 
possible only for scenes of limited spatial extension. When scenes of larger size are 
considered, 3D pattern cannot be employed anymore. In fact, an accurate 3D pattern 
is very expensive to build and quite cumbersome to handle. For this reason a multi- 
view calibration set-up has been developed, that allows us to generate the desired 3D 
set of fiducial points through multiple acquisition of a smaller and simpler calibration 
pattern, such as a planar target. In fact, the pattern can be placed in several different 
positions, so that the entire 3D volume of interest will be “scanned”. The relative 
motion between different positions of the pattern is not measured, therefore the 
position of the fiducial points in the scene is only partially known. This, of course, 
complicates the calibration problem, as the relative motion of the pattern must be 
estimated. In fact, we have six extra unknowns per additional target position. 
However, as the motion of the pattern from view to view, is the same for all the 
cameras, each camera will give its contribution to the estimation of the pattern 
motion. In fact, with respect to the case of calibrating one camera with V pattern 
views, each new camera to calibrate adds 1 1 more unknowns, while providing approx. 
2NV equations (the image coordinates of NV fiducial points). 

Self-calibration - Thanks to the large number of constraints and to the fact that, 
through multiple pattern positioning, the fiducial points end up covering the whole 
scene space, the self-calibrating approach can lead to the best results that can be 
obtained with such low-cost calibration setups, in terms of global accuracy throughout 
the scene space. In fact, experimental results have shown that, under proper 
conditions, the achieved calibration accuracy, with this approach, reaches the limit 
imposed by the accuracy with which the position of the fiducial points is known, with 
respect to the pattern frame. In order to further improve the accuracy of the 
calibration, it is either necessary to use calibration patterns of higher precision, for 
which the fiducial point coordinates have been determined with high accuracy (e.g. 
with photogrammetric techniques) or to adopt a self-calibrating approach, which, 
besides estimating the camera parameters, refines the estimates of the a-priori given 
coordinates of the fiducial points. The complexity of the former solution is the same 
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as in the previous case, but it requires expensive calibration patterns. As the aim of 
this work to obtain high performance at low cost, we focused on the latter solution. 
The self-calibration problem is much more undetermined than the previously 
considered one, because the calibration points coordinates WP are considered only 
approximately known. In other words, also the data points become, to some extent, 
unknowns to be estimated. The a-priori knowledge about the data generally consists 
of a rough estimate of the world-coordinates. The proposed technique is able not only 
to further improve the accuracy of the estimated camera parameters, but to refine the 
a-priori given estimate of the world coordinates as well. 



Qi 




Fig. 1. General scheme for the multi-view multi-camera approach to self-calibration 



The calibration target-set that we adopted is planar as the pixel size is assumed known 
[9]. A planar target-set is much simpler to build compared to a 3D target-frame as it 
can be easily constructed, for example, by gluing a laser-printed sheet of paper on a 
rigid planar surface. This procedure also gives us some a-priori information on the 
coordinates of the targets (and their uncertainty), relative to a frame attached to the 
surface. A 3D calibration target-frame, on the other hand, would require an accurate 
3D measurement of the coordinates of the targets (generally through some 
photogrammetric technique [11]). An example of application of our self-calibration 
approach is reported in Fig. 2. 
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Fig. 2. A-priori coordinates of the fiducial points of the target-set (laser-printed circles on a 
sheet of A4 paper, glued to a flat surface) and corresponding a-posteriori corrections 
estimated through self-calibration. The orientation of the (magnified) correction vectors 
denotes the deformation of the sheet of paper due to the action of the dragging mechanism 
of the laser printer. 

2.2 Adaptive Calibration 

In order to extract 3D information from the scene views the camera parameters must 
he known with good accuracy throughout the whole acquisition campaign. As camera 
calibration is performed before the beginning of an acquisition session, problems of 
parameter drift may occur. In fact, when long video sequences are acquired, the 
stability of the camera parameters measured at the beginning becomes a crucial 
problem as mechanical shocks, vibrations or thermal effects on cameras and supports, 
can cause small variations of the initial camera set-up. This drift of the camera 
parameters leads to significant 3D reconstruction errors, as the 3D back-projection is 
rather ill-conditioned with respect to the camera parameters. In order to overcome this 
problem, we detect and track any changes in the acquisition system and, whenever 
possible, we apply an on-the-fly correction of the camera parameters. By doing so, the 
calibration holds accurate throughout the acquisition campaign. 

Our approach does not require us to place targets in the scene or to use any a-priori 
knowledge, but exploits luminance features that are already present in the scene (e.g. 
corners and spots) which can be located in the image with high precision. After the 
localization process, which is performed with sub-pixel accuracy, a matching 
operation is performed among the n sets (n being the number of cameras) of feature 
points, which returns a set of n-tuples of homologous points. The matched n-tuples 
will be then back-projected into the 3D scene space. If the camera parameters change, 
then the back-projection will be affected by larger errors, with respect to the predicted 
pre-calibration accuracy. A proper analysis of the magnitude and the temporal 
changes of the back-projection error allows us to reveal and characterize any 
incidental modifications of the camera parameters. Furthermore, if the set of matched 
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M-tuples is informative enough, the proposed technique allows to accurately measure 
the occurred modification and, therefore, to re-calibrate the system. 

Our approach can be seen as composed of two main steps 

• check on the validity of the current camera parameters through the estimation of 
the back-projection’s accuracy 

• analysis of the temporal changes of the back-projection’s accuracy, in order to 
reveal increments in the reconstruction error that could likely denote a change in 
the system parameter 

The first step of the algorithm consists in the detection of the significant image 
features that will be used as control points. Our method is based on the techniques 
presented in [13,14,15]. In order to obtain super-resolution in the image localization 
accuracy, an algorithm for the local modeling of the image Laplacian function has 
been developed and employed in the localization procedure. The obtained results 
show that the introduced improvements has allowed to reach a localization accuracy 
better than 0.2 pixel [18]. 

Over the obtained sets of image points, an w-partite matching algorithm is applied, in 
order to find the stereo-corresponding w-tuples. The matching criterion is based not 
only on the epipolar geometry defined by the current calibration, as the calibration is 
not considered as reliable in this application, but also on the similarity of the local 
luminance profiles. All the matched w-tuples are then back-projected in the 3D scene 
space, and an ‘"accuracy index” is computed for each match, based on the back- 
projection error. The statistical distribution of this index over the matched points and 
its temporal behavior are then analyzed, in order to reveal any increment of the 
accuracy index that could very likely denote a change in the system parameters. 
Moreover, at the beginning of the sequence, the back-projected points that are most 
accurate and are fixed in the scene are selected as control points. These are the points 
that could then be used as 3D fiducial points for the re-calibration of the system. In 
fact, if the number of matched points is sufficient, it is possible to perform a reliable 
re-calibration of the system. When a change in the camera system has been detected, 
the current set of matched w-tuples of image features is exploited, in order to recover 
the new camera parameters. 

Assuming that the camera system is not subjected to a rigid motion with respect to the 
scene throughout the acquisition session, at the beginning of the sequence the most 
accurate and stable (fixed) back-projected points are detected and used as control 
points. These are the candidate points to be used as 3D targets for parameter 
correction, provided that their number is sufficient. When a parameter change is 
detected, the current set of matched w-tuples of image features is used for recovering 
the new camera parameters. 

Depending on the previous knowledge of the 3D position of the matched points, the 
algorithm adopts either a calibration or a self-calibration approach. More precisely, if 
the 3D position of some points had been measured at the beginning of the sequence 
when the system was still calibrated, then re-calibration is performed through a 
standard procedure that uses the available 3D points as markers. If, on the contrary, 
no reliable information is available about the actual 3D position of the matched 
points, the calibration can only be corrected through a self-calibration procedure. Self- 




Multi-Camera Acquisitions for High-Accuracy 3D Reconstruction 



131 



calibration allows to simultaneously determine the camera parameters and the 3D 
position of the fiducial points. 

This method, however, requires a larger number of matched points for accurate 
results, as the self-calibration problem is much more ill-conditioned than the 
calibration problem. We are currently working on a modified version of the method 
that is able to determine a rigid (rather than fixed) set of points and perform 
calibration with respect to a relative (rather than absolute) frame. 

The proposed technique has been tested on real sequences acquired with different 
trinocular camera systems, with both simulated and real variations of the camera 
parameters. In all experimental situations, the algorithm has been able to detect the 
modification of the camera parameters. Moreover, after artificial modifications of the 
camera system, of the same characteristics and entity of accidental ones (artificial 
shocks, change of focal length, etc.), the algorithm has been able to measure the drift 
of the parameters, thus allowing the re-calibration of the system. The results have 
shown that the accuracy of the re-calibration, in all cases, has reached the same 
accuracy of the original calibration. 

3. Partial Reconstruction 

Our approach to local reconstruction is based on feature correspondence. Image 
features that are most often used for 3D reconstruction are points, luminance edges 
and luminance profiles. Such features tend to provide information of different nature. 
Point and edge matching is generally a very precise and reliable process, but it 
usually results in a sparse set of 3D data. Conversely, matching the luminance profiles 
of small areas tends to generate a much denser set of 3D data but it is rather sensitive 
to the unavoidable viewer-dependent perspective/radiometric distortions, therefore 
this approach tends to be less stable and reliable. For this reason we developed a 
general and robust solution to the problem of 3D reconstruction from stereo 
correspondence of luminance patches. The method is largely independent on the 
camera geometry, and employs a calibrated set of three or more standard TV- 
resolution CCD cameras, which provides enough redundancy for removing possible 
matching ambiguities. The robustness of the approach can also be attributed to the 
physicality of the matching process, which is actually performed in the 3D space 
rather than on the image plane. In order to do so, besides the 3D location of the 
surface patches, it estimates their local orientation in 3D space as well, so that the 
geometric distortion of the luminance patch can be included in the model. Finally, the 
method takes into account the viewer-dependent radiometric distortion. 

3.1 Edge-Based Approach 

As a preliminary operation, we perform partial reconstruction from edge matching, in 
order to obtain reliable and accurate 3D data to begin with. The same type of features 
will later be used for egomotion estimation as well (which is based on 3D contour 
matching in object space). In order to be able to use edges for accurate egomotion 
estimation, however, we need to detect them with great accuracy. We do this by first 
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using a traditional edge detector, we then retrieve the subpixel location of the edge 
points through an interpolation process which takes the luminance gradient into 
account. Finally, a rule-based contour tracking method is employed for determining 
the correct connection between edge points. 

The search for homologous edges on different views is performed along epipolar 
lines. Notice that using more than two cameras allows us to avoid problems of 
matching ambiguity. With three cameras, in fact, not only can we always select the 
best pair of views for a specific stereo-correspondence (sharp intersection between 
edge and epipolar lines), but we can validate the match through a check on the third 
view. In fact, the edge point must lie on the intersection of the two epipolar lines 
associated to the homologous edge points on the other views. Once the matches are 
found, each set of corresponding contours is back-projected onto the 3D scene space 
by looking for the point at minimum distance from the three homologous visual rays. 

3.2 Area-Based Approach 

The luminance patches used by most area-matching techniques are normally assumed 
to have the same shape in all views. It is quite clear, however, that this hypothesis is 
acceptable only when the angles between the viewing directions of the three cameras 
are not too wide, which is not our case. As a consequence, we need to take into 
account the perspective distortion of the shape of the patch, when back-projected onto 
the object surface and then re-projected onto the other image plane. In order to do so, 
we assume the 3D surface to be locally flat, which means that it can be approximated 
by a plane within the back-projected surface patch. 

In the other view we search, along the distorted (due to radial distortion) epipolar line, 
for the patch that best matches the first one. The projective distortion of the patch is 
accounted for by estimating, both position and orientation of the patch. In practice, 
the minimum of a similarity function between a patch of the actual image and a re- 
projected patch after perspective warping is searched for as a function of position and 
local orientation of the tangent plane of the object surface. 

Area matching requires a comparison between the actual luminance profile of a patch 
with the one that we obtain by transferring luminance profiles of other views through 
a specific 3D surface model. Let S be a surface patch in object space, obtained by 
back-projecting a reference image patch of any of the views onto the plane s\=0, and 
let S*'* be its i-th view. The transfer of projective coordinates from the 7-th view to the 
i-th view through the plane s^x=0, can be expressed as a homography (an invertible 
linear projective transformation) of the form 

u<‘> = M,y(s)u*''* = 0 , 

where M,y (s) is a 3 by 3 matrix which depends on the parameters of the plane over 
which the patch lies. This homography allows us to express the luminance transfer 
from the y-th view to the i-th view as 

where is a correction factor (gain) that accounts for electrical differences in the 
camera sensors, while is an additive radiometric correction (offset) which 
accounts for non-Lambertian effects of the surface reflectivity (reflection’s migration 
with the viewpoint). Notice that the Lambertian component of the surface reflectivity 
does not appear in the above expression as it is the same for all views. 
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If a reference patch produces reliable 3D information, then it can be used for 3D 
surface reconstruction. Once all reference regions have been considered, surface 
interpolation is carried out and the area matching process can start over with a smaller 
patch size. In this case the previously estimated surface can be used for initializing the 
search in the next step and speeding up the process. 

As a general rule, we need to make sure that the maximum size of the patch is small 
enough to guarantee a limited matching error. On the other hand, we know that the 
area matching process is based on the minimization of a highly nonlinear similarity 
function, therefore we can expect the process to be quite sensitive to local minima. In 
order to avoid such a problem, we can use an initial guess of the surface shape, which 
helps the minimization process converge to a global minimum and dramatically 
speeds up the matching process by reducing the size of the search space. In principle. 




Fig. 3. Original views of the newspaper’s page (glued to a planar surface) and 3D points 
reconstructed through area matching. 

any method can be used for obtaining the initial surface. In our case we used the 
surface obtained through edge-matching [1,2], whose reliability is guaranteed by the 
accuracy of the camera model and the calibration procedure. As the result of the edge- 
matching is usually a sparse, though accurate, set of 3D points, such data is 
interpolated in order to generate the initial surface. We interpolated the 3D data by 
means of a modified and optimized version of the edge-preserving Discrete Smooth 
Interpolator (DSI) [16]. 

Some experiments of 3D scene reconstruction have been carried out on several test 
scenes. The first test presented in this paper concerns the measurement of the 
accuracy of the area matching using a flat object placed at about 1.2 m of distance 
from the camera system. The surface reconstruction resulted to be flat with 0. 1 mm of 
standard deviation (see Fig. 3). 

A second experiment concerned the 3D reconstruction of a tele-conferencing scene. 
The acquisition was made with a trinocular camera system at CCETT, France, within 
the ACTS “PANORAMA” Project (see Fig.4). No initial reconstruction was used for 
area matching. Instead, progressive-scan initialization was performed. The results of 
the area matching procedure are visible in Fig. 4. As we can see, the quality of the 
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reconstruction greatly benefits from the fact that the geometric distortion is included 
in the model. 




Fig. 4. One of the three original views of a conference scene (above) and two virtual views of 
the 3D points reconstructed through 3D area matching (below). 



4. Motion Estimation through Line Matching 

In order to be able to merge 3D data coming from different partial reconstructions, we 
need to accurately estimate the rigid motion that the acquisition system undergoes 
between two multi-view acquisitions. In order to do so, one could employ high- 
precision mechanical devices for positioning the camera system (or the object) before 
acquiring a multi-view. This a-priori solution of the ego-motion problem, however, is 
usually quite expensive and not very flexible. In alternative, one can perform 
detection and tracking of some image features throughout the acquisition process, and 
use the location of such features for estimating the camera motion. This last approach 
becomes particularly interesting when the features to be extracted are part of the scene 
to be reconstructed rather than being artificially added to it. Adding special markers 
to the imaged scene is, in fact, common practice in photogrammetry but, besides 
making the egomotion retrieval more invasive, it requires a certain expertise and 
slows down the acquisition process [8]. Conversely, natural point-like features that 
are already present in the scene are difficult to safely extract and accurately locate. 
Scene features that can be quite safely detected are, instead, luminance edges [6]. 
These features are more likely to be naturally present in the scene and rather easy to 
detect, which makes them good candidate features for egomotion estimation. 

Our method is based on the analysis of 3D contours in the imaged scene. Having 
adopted a calibrated multi-ocular camera system [9,11], the estimation is entirely 
performed in the 3D space. In fact, all edges of each one of the multi-views are 
previously localized, matched and back-projected onto the object space [12]. Roughly 
speaking, the method searches for the rigid motion that best merges the sets of 3D 
edges that are extracted from each one of the multiple views. 
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After partitioning the 3D contours in lines and curves, we proceed as follows: 

1 . rough egomotion estimation from straight contours: 

• matching of straight contours 

• motion estimation through minimization of the distance between 
homologous contours 

2. egomotion refinement using curved contours: 

• matching of curved contours 

• motion estimation through a minimization of the distance between 
homologous curved contours. 

Notice that, as a first approximation of the egomotion is already available, the 
matching of curved contours is a rather simple operation compared with the matching 
of straight lines. 



4.1 Egomotion from Straight Lines 

Line matching in 3D space is performed through a hypothesize-and-test type of 
procedure [17]. The first step of this method consists of formulating hypotheses on 
the possible couplings by selecting all those that do not violate some rules of 
congruence based on a set of geometrical constraints. By doing so, we drastically 
reduce the search space over which to test for matching correctness. At this point we 
can proceed with an exhaustive search through the above reduced set of hypotheses 
and select the match that maximizes a matching quality index. 

Once the matching process is complete, the egomotion estimation can be performed 
rather easily by searching for the rigid motion that minimizes an appropriate merging 
cost function between two sets of 3D lines that pertain two different partial 
reconstructions. Notice 3D contours are generally reconstructed as chains of segments 
whose length and fragmentation may vary quite drastically from multi-view to multi- 
view. We thus proceed by first determining the 3D line portions that best fit (through 
linear regression) the chains of fragments of edges that have been recognized as 
straight. Then instead of measuring the distances between extremal points of two 
segments, we measure the distance between the extremal points of one segment and 
the line that the other segment lies upon (see Fig. 5a). 




Fig. 5. Evaluation of the merging cost of two straight 3D contours (a). 3D curve matching: 
evaluation of the distance between two polylines (b). 
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Such distances are used for defining the merging cost as follows 

Cs=i[k'’f+k)']- 

i=i 

In fact, the orientation of edges is usually less sensitive to fragmentation problems 
than their location in the 3D space [1,17]. 

4.2 Egomotion Refinement from Curved Contours 

As already said above, curved contours are used for improving the accuracy of the 
egomotion’ s estimate. A matching process is still required but it is much simpler as 
we already have a first approximation of the camera motion, determined from straight 
edges. In fact, applying the pre-determined rigid motion to the set of curved edges, we 
can decide whether two curved edges are matched, depending on their global 
distance, which can be measured, with reference to Fig. 5a, as 

d^=^{d(C,C)+d(C,o) , d{C,C) = ^Y.d{E^,C) = ^Y.W^\ 

The global cost function for motion refinement is of the form C=Cs+kCc, where Cj 
and Cc are the merging costs associated to straight and curved contours, respectively, 
and k is weight for balancing the two contributes. 

4.3 Examples of Application 

The method has been extensively tested against convergence problems and has been 
applied to a series of trinocular acquisitions of real images in order to evaluate 
qualitatively and quantitatively the accuracy of the results and the speed of 
convergence. Furthermore, the performance of the proposed method has been 
compared with that of a previously studied method [2,8] based on point 
correspondences between artificially added markers. Quantitative results have been 
obtained by measuring the maximum thickness of the bundles of edges when 
superimposing different sets of them with the estimated motion parameters. The 
performance of the proposed method has been proven to be equal to or better than that 
of the point-based approach, resulting in a maximum bundle size of about 100 ppm in 
all tests (after merging all 3D edges coming from 20 multi-views). 

In Fig. 6, the results on 3D data merging are reported for an object of complex shape, 
in both cases of egomotion estimated through point and line correspondences. In the 
first case the cost function is a rigidity constraint based on the distance between 
reconstructed 3D points of different 3D data sets. Such points are markers that have 
been artificially added to the scene (white dots placed on the object’s support). In the 
second case the egomotion is computed with the method proposed in this paper. Even 
though no artificially added markers have been used for the estimation, the accuracy 
of the estimate is comparable with that obtained through point-matching. 
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5. Conclusions 

In this paper we presented our global approach to accurate 3D reconstruction with a 
calibrated multi-camera system. In particular, we presented a simple and effective 
technique for calibrating CCD-based multi-camera acquisition systems. The proposed 
method was proven to be capable of highly-accurate results even when using very 




Fig. 6. One of the original views of the object, fusion of all 3D edge sets through 3D point 
correspondences (added marks), fusion of all 3D edge sets through 3D contour matching 
(natural features). 



simple calibration target-sets (with little or no a-priori information on it) and low-cost 
imaging devices, such as standard TV-resolution cameras connected to commercial 
frame-grabbers. We also showed our approach to adaptive calibration, which proved 
effective for keeping track of camera parameter drift through natural feature tracking. 
We also proposed and illustrated a general and robust approach to the problem of 
close-range partial 3D reconstruction of objects from stereo-correspondences. The 
method is independent on the geometry of the acquisition system which could be a set 
of n cameras with strongly converging optical axes. The robustness of the approach 
can be mainly attributed to the physicality of the matching process, which is virtually 
performed in the 3D space. In fact, both 3D location and local orientation of the 
surface patches are estimated, so that the geometric distortion can be accounted for. 
The method takes into account the viewer-dependent radiometric distortion as well. 
Finally, we presented a method for performing an accurate patchworking of the partial 
reconstructions, through 3D feature matching. The method, based on the best fusion 
of 3D curves, provides very accurate results even when using standard TV-resolution 
CCD cameras. 
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Abstract. Modeling of 3D objects from image sequences is one of the 
challenging problems in computer vision and has been a research topic for 
many years. Important theoretical and algorithmic results were achieved 
that allow to extract even complex 3D scene models from images. One 
recent effort has been to reduce the amount of calibration and to avoid 
restrictions on the camera motion. In this contribution an approach is de- 
scribed which achieves this goal by combining state-of-the-art algorithms 
for uncalibrated projective reconstruction, self-calibration and dense cor- 
respondence matching. 



1 Introduction 

Obtaining 3D models from objects is an ongoing research topic in computer 
vision. A few years ago the main applications were robot guidance and visual 
inspection. Nowadays however the emphasis is shifting. There is more and more 
demand for 3D models in computer graphics, virtual reality and communication. 
This results in a change in emphasis for the requirements. The visual quality 
becomes one of the main points of attention. 

The acquisition conditions and the technical expertise of the users in these 
new application domains can often not be matched with the requirements of 
existing systems. These require intricate calibration procedures every time the 
system is used. There is an important demand for flexibility in acquisition. Cal- 
ibration procedures should be absent or restricted to a minimum. 

Additionally, the existing systems are often build around specialized hard- 
ware (e.g. laser range finders or stereo rigs) resulting in a high cost for these 
systems. Many new applications however require robust low cost acquisition 
systems. This stimulates the use of consumer photo- or video cameras. 

In this paper we present a system which retrieves a 3D surface model from a 
sequence of images taken with off-the-shelf consumer cameras. The user acquires 
the images by freely moving the camera around the object. Neither the camera 
motion nor the camera settings have to be known. The obtained 3D model is 
a scaled version of the original object (i.e. a metric reconstruction), and the 
surface albedo is obtained from the image sequence as well. 

Other researchers have presented systems for extracting 3D shape and texture 
from image sequences acquired with a freely moving camera. The approach of 
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Tomasi and Kanade used an affine factorization method to extract 3D from 
image sequences. An important restriction of this system is the assumption of 
orthographic projection. 

Another type of system starts from an approximate 3D model and camera 
poses and refines the model based on images (e.g. Facade proposed by Debevec 
et al. 1^). The advantage is that less images are required. On the other hand 
a preliminary model must be available and the geometry should not be too 
complex. 



Our system uses full perspective cameras and does not require prior models. 
It combines state-of-the-art algorithms of different domains: projective recon- 
struction, self-calibration and dense depth estimation. 

Projective Reconstruction: It has been shown by Faugeras |2j and Hart- 
ley H2 that a reconstruction up to an arbitrary projective transformation was 
possible from an uncalibrated image sequence. Since then a lot of effort has been 
put in reliably obtaining accurate estimates of the projective calibration of an 
image sequence. Robust algorithms were proposed to estimate the fundamental 
matrix from image pairs m. Based on this, an algorithm which sequentially 
retrieves the projective calibration of a complete image sequence has been de- 
veloped A more recent version based on the trifocal tensor was presented 
in 0. 



Self-Calibration: Since a projective calibration is not sufficient for many 
applications, researchers tried to find ways to automatically upgrade projective 
calibrations to metric (i.e. euclidean up to scale). Typically, it is assumed that 
the same camera is used throughout the sequence and that the intrinsic camera 
parameters are constant. This proved a difficult problem and many researchers 
have worked on it |SI22I35I1 3I25I34II 5f/!ti] . One of the main problems is that 
critical motion sequences exist for which self-calibration does not result in a 
unique solution m- We proposed a more pragmatic approach \‘27V2H\ which 
assumes that some parameters are (approximately) known but which allows 
others to vary. Therefore this approach can deal with zooming/focusing cameras. 
Others have proposed similar approaches j2rmj . 

Dense Depth Estimation: Since the calibration of the image sequence 
has been estimated we can use stereoscopic triangulation techniques between im- 
age correspondences to estimate depth. The difficult part in stereoscopic depth 
estimation is to find dense correspondence maps between the images. The cor- 
respondence problem is facilitated by exploiting constraints derived from the 
calibration and from some assumptions about the scene. We use an approach 
that combines local image correlation methods with a dynamic programming 
approach to constrain the correspondence search izq. This technique was first 
proposed by Gimmel’Farb UDI and further developed by others mm- 

The rest of the paper is organized as follows: In section 0 a general overview 
of the system is given. In the subsequent sections the different steps are explained 
in more detail: projective reconstruction (section Ej) , self-calibration (section EJ, 
dense matching (sectionE|) and model generation (sectionEJ. Section 0 concludes 
the paper. 
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2 Overview of the Method 

The presented system gradually retrieves more information about the scene and 
the camera setup. The first step is to relate the different images. This is done 
pairwise by retrieving the epipolar geometry. An initial reconstruction is then 
made for the first two images of the sequence. For the subsequent images the 
camera pose is estimated in the projective frame defined by the first two cameras. 
For every additional image that is processed at this stage, the interest points cor- 
responding to points in previous images are reconstructed, refined or corrected. 
Therefore it is not necessary that the initial points stay visible throughout the 
entire sequence. The result of this step is a reconstruction of typically a few hun- 
dred interest points. The reconstruction is only determined up to a projective 
transformation . 

The next step is to restrict the ambiguity of the reconstruction to a met- 
ric one. In a projective reconstruction not only the scene, but also the camera 
is distorted. Since the algorithm deals with unknown scenes, it has no way of 
identifying this distortion in the reconstruction. Although the camera is also as- 
sumed to be unknown, some constraints on the intrinsic camera parameters (e.g. 
rectangular or square pixels, constant aspect ratio, principal point in the middle 
of the image, ...) can often still be assumed. A distortion on the camera mostly 
results in the violation of one or more of these constraints. A metric reconstruc- 
tion/calibration is obtained by transforming the projective reconstruction until 
all the constraints on the cameras intrinsic parameters are satisfied. 

At this point the system effectively disposes of a calibrated image sequence. 
The relative position and orientation of the camera is known for all the view- 
points. This calibration facilitates the search for corresponding points and allows 
us to use a stereo algorithm that was developed for a calibrated system. This 
step allows to find correspondences for most of the pixels in the images. 

From these correspondences the distance from the points to the camera center 
can be obtained through triangulation. These results are refined and completed 
by combining the correspondences from multiple images. 

A dense metric 3D surface model is obtained by approximating the depth 
map with a triangular wire frame. The texture is obtained from the images and 
mapped onto the surface. 

In figure ^ an overview of the systems is given. It consists of independent 
modules which pass on the necessary information to the next modules. The 
first module computes the projective calibration of the sequence together with 
a sparse reconstruction. In the next module the metric calibration is computed 
from the projective camera matrices through self-calibration. Then dense corre- 
spondence maps are estimated. Finally all results are integrated in a textured 
3D surface reconstruction of the scene under consideration. 

Throughout the rest of the paper the different steps of the method will be 
explained in more detail. An image sequence of the Arenberg castle in Leuven 
will be used for illustration. Some of the images of this sequence can be seen in 
Figure El The full sequence consists of 24 images recorded with a video camera. 
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Fig. 1. Overview of the system: from the image sequence (/(x, y)) the projective 
reconstruction is computed; the projection matrices P are then passed on to the 
self-calibration module which delivers a metric calibration ; the next module 
uses these to compute dense depth maps D{x, y)\ all these results are assembled 
in the last module to yield a textured 3D surface model. On the right side 
the results of the different modules are shown: the preliminary reconstructions 
(both projective and metric) are represented by point clouds, the cameras are 
represented by little pyramids, the results of the dense matching are accumulated 
in dense depth maps (light means close and dark means far). 
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Fig. 2. Some images of the Arenberg castle sequence. This sequence is used 
throughout the paper to illustrate the different steps of the reconstruction sys- 
tem. 



2.1 Notations 



In this section some notations used in this paper are introduced. A detailed 
explanation of the basic concepts can be found in m- Projective geometry and 
homogeneous coordinates are used throughout this paper. Metric entities are 
indicated with a subscript A4 . 

The following equation is used to describe the perspective projection of the 
scene onto the images 

m oc PM (1) 

where P is a 3 x 4 projection matrix describing the perspective projection pro- 
cess, M = [XY Z 1]^ and m = [xy 1]^ are vectors containing the homogeneous 
coordinates of the world points respectively image points. Note that oc will be 
used throughout this paper to indicate equality up to a non-zero scale factor. 
Indexes i and j will be used for points (e.g. Mi), indexes k and I for views (e.g. 

Pfc). 

In the metric case the camera projection matrix factorizes as follows: 



Pa., = K[R l-Rt] 



( 2 ) 



Here (R, t) denotes a rigid transformation (i.e. R is a rotation matrix and t is 
a translation vector) which indicate the position and orientation of the camera, 
while the upper triangular calibration matrix K encodes the intrinsic parameters 
of the camera: 



K = 



fx ^ Ux 

fy Uy 
1 



( 3 ) 



where fx and fy represent the focal length divided by the pixel width resp. 
height, (ux,Uy) represents the principal point and s is a factor which is zero for 
rectangular pixels. 

The following notations are used for the epipolar geometry: is the fun- 

damental matrix for views k and I, eu is the epipole corresponding to this 
fundamental matrix in view 1. 
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3 Projective Reconstruction 

At first the images are completely unrelated. The only assumption is that the 
images form a sequence in which consecutive images do not differ too much. 
Therefore the local neighborhood of image points originating from the same 
scene point should look similar if images are close in the sequence. This allows 
for automatic matching algorithms to retrieve correspondences. 



3.1 Relating the Images 

It is not feasible to compare every pixel of one image with every pixel of the 
next image. It is therefore necessary to reduce the combinatorial complexity. 
In addition not all points are equally well suited for automatic matching. The 
local neighborhoods of some points contain a lot of intensity variation and are 
therefore easy to differentiate from others. An interest point detector (i.e. the 
Harris corner detector mi) is used to select a certain number of such suited 
points. These points should be well located and indicate salient features that 
stay visible in consecutive images. Correspondences between these image points 
need to be established through a matching procedure. 

Matches are determined through normalized cross-correlation of the intensity 
values of the local neighborhood. Since images are supposed not to differ too 
much, corresponding points can be expected to be found back in the same region 
of the image. Therefore at first only interest points which have similar positions 
are considered for matching. When two points are mutual best matches they are 
considered as potential correspondences. 

Since the epipolar geometry describes the complete geometry relating two 
views, this is what should be retrieved. Computing it from the set of potential 
matches through least squares does in general not give satisfying results due to 
its sensitivity to outliers. Therefore a robust approach should be used. Several 
techniques have been proposed based on robust statistics m- Our system 

incorporates the RANSAC (RANdom SAmpling Consesus) approach used by 
Torr 1^. Table 0 sketches this technique. 



• repeat 

- take minimal sample (7 matches) 

- compute F 

- estimate %inliers 

until PQY^{%inliers,#trials) > 95% 

• refine F (using all inliers) 

Table 1. Robust estimation of the epipolar geometry from a set of matches con- 
taining outliers using RANSAC (Rok indicates the probability that the epipolar 
geometry has been correctly estimated). 
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Once the epipolar geometry has been retrieved, one can start looking for 
more matches to refine this geometry. In this case the search region is restricted 
to a few pixels around the epipolar lines. 



3.2 Initial Reconstruction 



The two first images of the sequence are used to determine a reference frame. 
The world frame is aligned with the first camera. The second camera is chosen 
so that the epipolar geometry corresponds to the retrieved F12 (see E2I)- 



Pl = [ I3X3 0 ] 

P2 = [ [ei2]xPi2 + ei20^ 04612 ] 



( 4 ) 



where [612] x indicates the vector product with ei2. EquationE|is not completely 
determined by the epipolar geometry (i.e. F 12 and 612), but has 4 more degrees of 
freedom (i.e. Qi,i = 1 ... 4 ). a = [010203]^ determines the position of the plane at 
infinity and 04 determines the global scale of the reconstruction. To avoid some 
problems during the reconstruction it is recommended to determine a in such 
a way that the plane at infinity does not cross the scene. Our implementation 
follows the quasi-Euclidean approach proposed in but an alternative would 
be to use Hartley’s cheirality m or oriented projective geometry m- Since 
there is no way to determine the global scale from the images, 04 can arbitrarily 
be chosen to 04 = 1. 

Once the cameras have been fully determined the matches can be recon- 
structed through triangulation. The optimal method for this is given in d 
This gives us a preliminary reconstruction. 



3.3 Adding a View 

For every additional view the pose towards the pre-existing reconstruction is 
determined, then the reconstruction is updated. This is illustrated in Figure 0 
The first steps consists of finding the epipolar geometry as described in Sec- 
tion EH Then the matches which correspond to already reconstructed points 
are used to compute the projection matrix P^. This is done using a robust pro- 
cedure similar to the one laid out in Table 0 In this case a minimal sample of 6 
matches is needed to compute Pfc. Once Pfc has been determined the projection 
of already reconstructed points can be predicted. This allows to find some addi- 
tional matches to refine the estimation of P^. This means that the search space 
is gradually reduced from the full image to the epipolar line to the predicted 
projection of the point. This is illustrated in Figure 0 

Once the camera projection matrix has been determined the reconstruction 
is updated. This consists of refining, correcting or deleting already reconstructed 
points and initializing new points for new matches. 

After this procedure has been repeated for all the images, one disposes of 
camera poses for all the views and the reconstruction of the interest points. In 
the further modules mainly the camera calibration is used. The reconstruction 
itself is used to obtain an estimate of the disparity range for the dense stereo 
matching. 
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Fig. 3. Image matches mfc) are found as described before. Since the image 

points, TTik-i, relate to object points, Mk, the pose for view k can be computed 
from the inferred matches (M,mk). 




Fig. 4. (a) a priori search range, (b) search range along the epipolar line and (c) 
search range around the predicted position of the point. 



4 Self-Calibration 



The reconstruction obtained as described in the previous paragraph is only de- 
termined up to an arbitrary projective transformation. This might be sufficient 
for some robotics or inspection applications, but certainly not for visualization. 
In this section a technique to restrict this ambiguity to metric is described. 

For a metric calibration the factorization of the camera projection matrices 
as in Equation |3 yields the physical parameters of the camera. A necessary 
condition for a metric reconstruction is therefore that constraints which exist on 
the intrinsic camera parameters are verified through this factorization. 

To apply the following method to standard zooming/focusing cameras, some 
assumptions should be made. Often it can be assumed that pixels are rectan- 
gular or even square. If necessary (e.g. when only a short image sequence is at 
hand, when the projective calibration is not accurate enough or when the motion 
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sequence is close to critical m without additional constraints), it can also be 
used that the principal point is close to the center of the image. 

For the actual computations the absolute conic oj is used. This is an imaginary 
conic located in the plane at infinity Uoa- Both entities are the only geometric 
entities which are in invariant under all Euclidean transformations. The plane at 
infinity and the absolute conic respectively encode the affine and metric proper- 
ties of space. This means that when the position of Uoo is known in a projective 
framework, affine invariants can be measured. Since the absolute conic is in- 
variant under Euclidean transformations its image only depends on the intrinsic 
camera parameters (focal length, ...) and not on the extrinsic camera parameters 
(camera pose). The following equation applies for the dual image of the absolute 
conic: 

« KfcRT (5) 

Therefore constraints on the intrinsic camera parameters are readily translated 
to constraints on the dual image of the absolute conic. This image is obtained 
from the absolute conic through the following projection equation: 

u;l cx (6) 

where fi* is the dual absolute quadric which encodes both the absolute conic and 
its supporting plane, the plan at infinity. The constraints on can therefore be 
back-projected through this equation. The result is a set of constraints on the 
position of the absolute conic (and the plane at infinity). 

Our systems first uses a linear method to obtain an approximate calibration. 
This calibration is then refined through a non-linear optimization step in a second 
phase. 



4.1 Initial Calibration 



To obtain a linear algorithm some assumptions have to be made. If the pixels 
are square and the principal point is in the middle of the image, the image can 
be transformed to obtain the following intrinsic camera parameters: 



Kfc 



fk 0 0 

/fcO 

1 



This simplifies Equation (0 as follows: 





71 0 o' 
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0 fi 0 


= Pfe 


C2 C5 Cq C7 




0 0 1 




C3 Ce Cg Cg 








_C4 C7 Cg Cio_ 



( 7 ) 



( 8 ) 



with A an explicit scale factor. From the left-hand side of Eq. (0 it can be seen 
that the following equations have to be satisfied: 



*( 11 ) 



,♦( 22 ) 



( 9 ) 
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^ ^*(13) ^ ^,(23) ^ Q 
^*(21) ^^*(31) ^ ^*(32) ^ 0 _ 

with representing the element on row i and column j of Note that due 

to symmetry (cni) and dD result in identical equations. These constraints can 
thus be imposed on the right-hand side, yielding 4 independent linear equations 
in Ci, t = 1 ... 10 for every image: 

^P^2)^*P^2)T 

= 0 

f2* Pjf^^ = 0 
2Pf = 0 

with pjf'^ representing row j of and 12* parameterized as in Q. The rank 
3 constraint can be imposed by taking the closest rank 3 approximation (using 
SVD for example). This approach holds for sequences of 3 or more images. The 
special case of 2 images can also be dealt with, but with a slightly different 
approach. For more details see EHl- 



4.2 Refined Calibration 



To refine the calibration Eq. (j0|) is used directly in a non-linear least squares 
criterion. In this case the user is free to specify the constraints which should be 
imposed. Every intrinsic parameter can be known, fixed or free. The dual image 
absolute conics should be parameterized in such a way that these constraints 
are enforced. For the absolute quadric ^2* a parameterization should be used 
which takes into account the symmetry and the rank 3 constraint. Since is 
only determined up to scale this leaves us with a minimum parameterization of 8 
parameters. This can be done by putting = 1 and by calculating from the 
rank 3 constraint. The following parameterization satisfies these requirements: 



f2* 



KK^ -KK^a 

-a^KK^ a^KK^a 



( 12 ) 



Here a defines the position of the plane at infinity iloo = 1]^- In this case 

the transformation from projective to metric is particularly simple: 



L-p^M = 



K-1 0 
1 



(13) 



An approximate solution to these equations can be obtained through non-linear 
least squares. The following criterion should be minimized (||.||_f is the Frobenius 
norm) : 



n 

min 

i=l 



K.K7 


P^^2*PJ 


I|k.k7||^ 





(14) 
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5 Dense Depth Estimation 

Only a few scene points are reconstructed from feature tracking. Obtaining a 
dense reconstruction could be achieved by interpolation, but in practice this does 
not yield satisfactory results. Small surface details would never be reconstructed 
in this way. Additionally, some important features are often missed during the 
corner matching and would therefore not appear in the reconstruction. 

These problems can be avoided by using algorithms which estimate corre- 
spondences for almost every point in the images. At this point algorithms can 
be used which were developed for calibrated stereo rigs. 



5.1 Rectification 

Since we have computed the calibration between successive image pairs we can 
exploit the epipolar constraint that restricts the correspondence search to a 1-D 
search range. It is possible to re-map the image pair to standard geometry with 
the epipolar lines coinciding with the image scan lines m- The correspondence 
search is then reduced to a matching of the image points along each image scan- 
line. This results in a dramatic increase of the computational efficiency of the 
algorithms by enabling several optimizations in the computations. The rectifica- 
tion procedure is illustrated in Figure0 For some motions (i.e. when the epipole 
is located in the image) standard rectification based on planar homographies is 
not possible and a more advanced procedure should be used m- 




Fig. 5. Through the rectification process the image scan lines are brought into 
epipolar correspondence. This allows important gains in computational efficiency 
and simplification of the dense stereo matching algorithm. 
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5.2 Dense Stereo Matching 

In addition to the epipolar geometry other constraints like preserving the order 
of neighboring pixels, bidirectional uniqueness of the match, and detection of 
occlusions can be exploited. These constraints are used to guide the correspon- 
dence towards the most probable scan-line match using a dynamic programming 
scheme 0. 

For dense correspondence matching a disparity estimator based on the dy- 
namic programming scheme of Cox et al. 0, is employed that incorporates the 
above mentioned constraints. It operates on rectified image pairs where 

the epipolar lines coincide with image scan lines. The matcher searches at each 
pixel in image Ik for maximum normalized cross correlation in Ii by shifting a 
small measurement window (kernel size 5x5 to 7x7 pixel) along the correspond- 
ing scan line. The selected search step size AD (usually 1 pixel) determines the 
search resolution. Matching ambiguities are resolved by exploiting the ordering 
constraint in the dynamic programming approach m- The algorithm was fur- 
ther adapted to employ extended neighborhood relationships and a pyramidal 
estimation scheme to reliably deal with very large disparity ranges of over 50% 
of image size [6] . 

5.3 Multiview Matching 

The pairwise disparity estimation allows to compute image to image correspon- 
dence between adjacent rectified image pairs, and independent depth estimates 
for each camera viewpoint. An optimal joint estimate is achieved by fusing all 
independent estimates into a common 3D model. The fusion can be performed 
in an economical way through controlled correspondence linking. The approach 
utilizes a flexible multi viewpoint scheme which combines the advantages of small 
baseline and wide baseline stereo m 

Assume an image sequence with k = 1 n images. Starting from a reference 
view point k the correspondences between adjacent images (fc-|-l,fc-|-2,...,n) 
and (fc — 1, fc — 2, ..., 1) are linked in a chain. The depth for each reference image 
point ruk is computed from the correspondence linking that delivers two lists of 
image correspondences relative to the reference, one linking down from fc — > 1 
and one linking up from fc — > n. For each valid corresponding point pair (xk, xi) 
we can triangulate a depth estimate d{xk,xi) along Sm^ with e; representing the 
depth uncertainty. The left part of Figure Elvisualizes the decreasing uncertainty 
interval during linking. 

While the disparity measurement resolution AD in the image is kept constant 
(at 1 pixel), the reprojected depth error ei decreases with the baseline. Outliers 
are detected by controlling the statistics of the depth estimate computed from 
the correspondences. All depth values that fall within the uncertainty interval 
around the mean depth estimate are treated as inkers. They are fused by a 1-D 
kalman filter to obtain an optimal mean depth estimate. Outliers are undetected 
correspondence failures and may be arbitrarily large. As threshold to detect the 
outliers we utilize the depth uncertainty interval e;. 
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The result of this procedure is a very dense depth map. Most occlusion prob- 
lems are avoided by linking correspondences from up and down the sequence. 
An example of such a very dense depth map is given in Figure El 




Fig. 6. Depth fusion and uncertainty reduction from correspondence linking 
(left), Resulting dense depth map (light means near and dark means far) (right). 



6 Building the Model 

The dense depth maps as computed by the correspondence linking must be 
approximated by a 3D surface representation suitable for visualization. So far 
each object point was treated independently. To achieve spatial coherence for 
a connected surface, the depth map is spatially interpolated using a paramet- 
ric surface model. The boundaries of the objects to be modeled are computed 
through depth segmentation. In a first step, an object is defined as a connected 
region in space. Simple morphological filtering removes spurious and very small 
regions. We then employ a bounded thin plate model with a second order spline 
to smooth the surface and to interpolate small surface gaps in regions that could 
not be measured. If the object consist of dominant planar regions, the local 
surface normal may be exploited to segment the object into planar parts IZDj. 

The spatially smoothed surface is then approximated by a triangular wire- 
frame mesh to reduce geometric complexity and to tailor the model to the re- 
quirements of Computer Graphics visualization systems. The mesh triangulation 
currently utilizes the reference view only to build the model. The surface fusion 
from different view points to completely close the models remains to be imple- 
mented. Sometimes it is not possible to obtain a single metric framework for 
large objects like buildings since one may not be able to record images contin- 
uously around it. In that case the different frameworks have to be registered to 
each other. This will be done using available surface registration schemes P). 

Texture mapping onto the wire-frame model greatly enhances the realism of 
the models. As texture map one could take the reference image texture alone and 
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map it to the surface model. However, this creates a bias towards the selected 
image and imaging artifacts like sensor noise, unwanted specular reflections or 
the shading of the particular image is directly transformed onto the object. A 
better choice is to fuse the texture from the image sequence in much the same 
way as depth fusion. 

The viewpoint linking builds a controlled chain of correspondences that can 
be used for texture enhancement as well. The estimation of a robust mean texture 
will capture the static object only and the artifacts (e.g. specular reflections or 
pedestrians passing in front of a building) are suppressed [I Yj . The texture fusion 
could also be done on a finer grid, yielding a super resolution texture m- 

An example of the resulting model can be seen in Figure Q 




Fig. 7. 3D surface model obtained automatically from an uncalibrated image 
sequence, shaded (left), textured (right). 



7 Conclusion 

An automatic 3D scene modeling technique was discussed that is capable of 
building models from uncalibrated image sequences. The technique is able to 
extract metric 3D models without any prior knowledge about the scene or the 
camera. The calibration is obtained by assuming a rigid scene and some con- 
straints on the intrinsic camera parameters (e.g. square pixels). 

Work remains to be done to get more complete models by fusing the partial 
3D reconstructions. This will also increase the accuracy of the models and elim- 
inate artifacts at the occluding boundaries. For this we can rely on work already 
done for calibrated systems. 
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Abstract. As virtual worlds demand ever more realistic 3D models, 
attention is being focussed on systems that can acquire graphical models 
from real objects. This paper describes a system which, given a sequence 
of images of an object rotating about a single axis, generates a textured 
3D model fully automatically. In contrast to previous approaches, the 
technique described here requires no prior information about the cameras 
or scene, and does not require that the turntable angles be known (or 
even constant through the sequence). 

From an analysis ofthe projective geometry of the situation, it is shown 
that the rotation angles may be determined unambiguously, and that 
camera calibration, camera positions and 3D structure may be deter- 
mined to within a two parameter family. An algorithm has been im- 
plemented to compute this reconstruction fully automatically. The two 
parameter reconstruction ambiguity may be removed by specifying, for 
example, camera aspect ratio and parallel scene lines. Examples are pre- 
sented on four turn-table sequences. 



1 Introduction 

Numerous graphics and computer vision papers have dealt with the construction 
of 3D solid models by volume intersection from multiple views. As pointed out 
by Ponce m the idea dates back to Baumgart |2j in 1974. Well engineered 
systems built on this idea have yielded 3D texture mapped graphical models 
of impressive quality 0 031. A good example is the system of Hannover HH 
where, as is usual for such systems, the object is rotated on a turntable against 
a background which can easily be removed by image segmentation. Such systems 
are generally completely calibrated, i.e. the camera internal parameters, rotation 
angles, distance to the rotation axis etc are all accurately known. 

In this paper we develop the projective geometry of single axis rotation and 
describe its automatic and optimal estimation from an image sequence with no 
other a priori information supplied. It is shown that 3D structure and cameras 
can be estimated (including auto-calibration) up to an overall two-parameter 
ambiguity. The angle of rotation between views is not ambiguous. This geom- 
etry is described in section El and an algorithm to automatically estimate this 
geometry from an image sequence is given in section 0 
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Hannover dinosaur IlYI . 36 frames 




Cup, 36 frames 





Freiburg, 36 frames 



Fig. 1. Some example sequences. 



We then describe a modelling system based on this estimated geometry. The 
input is a turn-table image sequence, the output is the set of cameras and a 3D 
VRML texture mapped model of the object, with all processing automatic. Other 
than the estimation of the camera geometry, the system is much the same as the 
calibrated Hannover system, and involves: volume intersection; representation of 
the surface as a triangulated network; triangle grouping; and texture mapping. 
This is described in section E] The output models are of equal quality to those 
of fully calibrated sequences — a fact demonstrated on a sequence supplied by 
Hannover and shown in figure [H 

General uncalibrated multiple-view geometry Before specializing to single axis 
rotation, consider first the general case of reconstruction from multiple pinhole 
cameras viewing a 3D scene m- 3D points X in the scene are represented as 
homogeneous 4- vectors [x, Y, Z, 1]^, while their 2D projections x are represented 
as homogeneous 3- vectors [x, y,l]^ ■ The action of each camera is represented by 
a 3 X 4 projection matrix P: 



= P,;X, 



The m cameras are indicated by Pi while the n 3D points are Xj . 
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In the case where m different cameras view a scene, there is no relation- 
ship between the P^. Therefore 11m parameters are required to specify all the 
cameras. When the cameras have identical internal parameters, such as when a 
camera is moved through a static scene without any change in focus or zoom, 
the internal parameters are constant over the sequence. This reduces the number 
of parameters required to specify the cameras from 11m to 6m -I- 5. 

Uncalibrated single-axis multiple-view geometry In the single-axis case, we shall 
see that the number of parameters is reduced to m -I- 8, and that estimation is 
relatively straightforward. It is worth contrasting the reduction in the number 
of parameters that occurs in this special motion case, with a popular alternative 
which is to reduce the number of parameters by approximating the perspec- 
tive camera as a weak perspective camera jSlIEI. Such “affine” cameras are an 
approximation of the geometry, and under imaging conditions which typically 
apply in close-range model acquisition, this approximation can be quite poor. 
However, the advantage of this approximation is a simple, non-iterative estima- 
tion algorithm In contrast, specializing the motion to single axis is an exact 
model of the geometry, not an approximation, yet it admits a closed-form solu- 
tion. Previous investigations of turn-table sequences [12112131221 have not fully 
exploited the special motion to simplify camera recovery. 



2 The Projective Geometry of Single Axis Motion 

A single axis motion consists of a set of Euclidean actions on the world such that 
the relative motion between the scene and camera can be described by rotations 
about a single fixed axis. In the language of screw decompositions, any Euclidean 
action can be decomposed as a rotation about a screw axis (which is parallel to 
the Euclidean rotation axis) together with a translation along the screw axis. In 
the case of single axis motions there is zero translation along the screw axis, and 
the screw axes of each Euclidean action coincide. 

There are many cases of this motion commonly occurring in computer vision 
applications. The most common, and the one that is used here, is the case of a 
static camera viewing an object rotating on a turntable. A second case is that 
of a camera rotating about a fixed axis. For example, imagine a QuickTime VR 
acquisition device where the camera is offset along its principal axis, so that 
it does not rotate about its centre. A third case is that of a camera viewing a 
rotating mirror. 

It will be helpful in the following to consider that the object is fixed and that 
the camera rotates about it. The camera internal parameters are fixed. To aid 
visualisation, we assume that the rotation axis is vertical, so that the camera 
rotates in a horizontal plane. 

We now describe the camera and image geometry arising from this con- 
strained motion, particularly the fixed entities of the motion, which play an 
important role. It will be seen that the fundamental matrix, F, trifocal tensor 
T and camera matrices P all have additional properties, and that the multiple 
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view tensors (F and T) determine a two-parameter family of camera matrices. 
This ambiguity is removed using internal and external constraints. 

2.1 3D Geometry 

Under a single axis rotation the camera centre describes a circle in a horizontal 
plane tt/j. The geometry is illustrated in figure 0 There are a number of geometric 
entities which are fixed under this motion, including: 

— The (vertical) rotation axis denoted ( “s” for “screw” axis) . This is a line 
of fixed points. 

— The plane tt/j, and indeed the pencil of horizontal planes. Each plane is fixed 
as a set. 




Fig. 2. 3D geometry. The cameras are indicated by their centres (spheres), 
and image planes. The point is the intersection of the plane, tt/^, containing 
the camera centres with the rotation axis Lg. 



2.2 Image Fixed Entities 

The 3D fixed entities are sequence invariants since they are imaged at the same 
position in every view. Their images include: 

— The line Is which is the image of the rotation axis Lg. Since points on Lg are 
fixed under the motion, their images are also fixed under the motion. 

— The line in which intersects each image plane. It is the vanishing line 
of 7T/i (and indeed of all planes parallel to tt/,.). 
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— The point Xg which is the image of the fixed point X^. 

— The point v which is the vanishing point of the rotation axis. 

These sequence invariants are illustrated schematically in FigureS^, and the two 
fixed lines are illustrated in Figure 0| on a real sequence. 




Fig. 3. (a) Fixed image entities over the sequence, and their relation to the 
columns hi of H. (b) Two- view entities. The entities which can be determined 
from F and their relation to the columns of H. The symmetric part of F is a 
degenerate conic consisting of the two lines Ig and Ih- The anti- symmetric part 
is represented by the point Xa- Points Xg and Xa have fixed position over all 
view pairs. The position of the epipoles depends on the angle of rotation A9i 
between views. 



2.3 Camera Matrices 

We have the freedom to choose the world coordinate system so that the rotation 
axis is aligned with the world z axis, and the first camera centre is at position t 
on the X axis. Thus the first camera may be written 

Po = H[I I t] 

where H is a homography representing the camera internal parameters and rota- 
tion about the camera centre, and t = {t, 0, 0)^. A rotation of the camera by 6 
about the Z axis is achieved by post-multiplying Pq by 

'Rz{d) O' 

0 ^ 1 

yielding the camera Pe = 'H[Rz{d) | t]. In detail, with hi the columns of H: 
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Fig. 4 . Fixed image lines. The fixed image lines are shown overlaid on images 
from the head sequence. The almost vertical line is Ig, the horizontal line is 
Ih (see also figure The eyebrow of the mannequin, which is approximately 
coplanar with izh, remains tangent to Ih as the object rotates. These fixed lines 
are automatically computed from the images using the algorithm of (0 



This division of the internal and external parameters means that H and t are 
fixed over the sequence, only the angle of rotation, 9i, about the z axis varies 
for each camera Pi. Given this parametrization, the estimation problem can now 
be precisely stated: we seek the common matrix H and the angles 9i in order to 
estimate the set of cameras Pi for the sequence. Thus a total of 8 + to parameters 
must be estimated for m views, where 8 is the number of degrees of freedom 
of the homography H. Note, the magnitude of translation only determines the 
overall scaling and need not be considered further as we are interested only in a 
similarity reconstruction. The relative angle between views i and i + 1 is denoted 
A9„ 

We now relate the columns of H to the fixed image entities: 

— Xg is the image of = (0, 0, 0, 1)^, so under any Pg, Xg = H(f, 0, 0)^ = thi. 

— V is the image of the direction of the world z axis (0, 0, 1, 0)^, giving v = h^. 

— lg = hi X h^. 

~ lh = hi X h2- 

These relations are shown in figure Et. To see that Ig, the image of the Z axis, is 
given by Ig = hi x h^, consider a general point on z, (0, 0, u, Its projection 
by any Pg is H(tu, 0, = tvhi + uh^, a point on the line through hi and h^. 

Similar consideration of a general point on tt^ leads to Ih = hi x h2- 

The columns of H are the vanishing points of an orthogonal triad of directions. 
This triad rotates with the camera such that these vanishing points are related 
to the fixed entities. h2 is the vanishing point of the direction orthogonal to 
those corresponding to hi and h^. 

The procedure from here on is to determine the columns of H from the multi- 
ple view tensors (F, T) . We first consider the reconstruction ambiguity, where it 
will be seen that from the multiple view tensors (i.e. from image measurements 
alone) H is not determined uniquely, but is restricted to a two-parameter family. 
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2.4 Reconstruction Ambiguity 

It is well known @HH that if nothing is known of the calibration of 2 or more 
cameras, nor their relative placement, then the reconstruction of the scene and 
cameras is determined only up to an arbitrary projective transformation of 3- 
space. For if T is any 4x4 invertible matrix, representing a projective transfor- 
mation of P^, then replacing points Xj by TXj and cameras Pi by PiT~^ does 
not change the image points since Xij = PiXj = PiT~^TXj. 

In the case of single axis rotation we know that the cameras Pi have the 
restricted form dO, so we may ask the question: suppose we determine a recon- 
struction with a set of cameras of the form m, how are these cameras related 
to the actual cameras? 

To answer this question m, consider the class of transformations T which 
preserve the form dU- Suppose we have two reconstructions with sets of cameras 
Pi = E[Rz{0i) I t] and P- = | t'] of the correct form. Then, T is an 

admissible transformation if the sets of cameras are related as: 



P'i = H'[i?z(0') I t'] = E[Rzm \t]T Vi (2) 

over at least 3 views (i.e. m > 3). We require that both H and H' are full rank 
3x3 matrices independent of 0 and 9', and t = (t, 0,0)^,t' = (F,0,0)^. Since 
we are not concerned with the Euclidean transformation part of the ambiguity, 
T may be written as 



where U is an upper triangular matrix. It can be shown that that Q has a 
solution provided: 0- = 9i and 



10 0 0 
0 10 0 
0 0 a 0 
0 0/31 



(3) 



with a and j3 arbitrary scalars. 

This shows: (i) from image measurements alone the camera matrices can 
be recovered only up to a two parameter ambiguity parametrized by a and /3. 
Note that the angle 0 is not ambiguous; (ii) the actual cameras lie in this two 
parameter family, so the reconstruction is also related to the actual cameras 
by O; ( iii) the matrix H is only determined up to this ambiguity. To see this, 
note that 
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This means that the last column of H can only be determined to within a 2- 
parameter ambiguity from image measurements alone. We will see this ambiguity 
arising when computing H from F and T in the following sections, and return in 
section o to methods of resolving the ambiguity. 

2.5 Two- View Geometry 

The 2- view geometry of single axis rotation is identical to that of planar motion, 
for which many of the following properties of F have been derived PHEl. In 
the planar motion case, however, the axis Lg varies between view pairs, i.e. it is 
not fixed over the sequence. 

The fundamental matrix may be parametrized in terms of the fixed image 
lines and an image point Xa as 



where the 3- vectors Xa,lh, Is are scaled to unit norm. 

Once F is estimated from two view correspondences then the points Xa and 
Xs and lines Is and Ih are known. Their relation to H is shown in figure 03, and 
also can be read off from the expression for F in terms of H and A9: 



Taking account of the unknown scaling of the homogeneous 3-vectors, the 
columns of H are determined from F (i.e. the 2- view geometry), to within the 
3-parameter family 



parametrized by the as yet undetermined scalars fi, v, and oj, where d is an 
(arbitrary) point on Ig, which may be chosen as d— Ig x (0, 0, 1)^. In detail the 
columns are determined by the following procedure: 

1. Extract Xa from the antisymmetric part of F, F — F^ = 

2. Extract epipoles e and e', and compute Ih = e x e' . 

3. Compute Ig = {2lh^lhl - lhlh^){F + F^)lh- 

4. Compute Xg = Ig x Ih- 

5. Set H according to 0). 

Although the ratio ki, where fi = ki tan may be computed from F, the value 
of /i is unknown. This means that A9i cannot be computed from two views. 

2.6 Three View Geometry 

From three views, it can be shown that the trifocal tensor may be written as a 
pencil of tensors, parametrized by 






H = [hi, ^2, ha] = [xg,nxa, vxg + ud] 



( 4 ) 



T = ^?K. + K,' 
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where the elements of the tensors K, and 1C' are computed from the two-view 
quantities ki and H. Thus, one three-view point correspondence allows fi, and 
hence the A9i, to be recovered uniquely. The only remaining ambiguity is in the 
third column of H. As shown in section this ambiguity cannot be reduced 
further by the single axis motion constraint alone. 

2.7 Removing the Reconstruction Ambiguity 

The reconstruction ambiguity of (H is the following metric structure is 
recovered in planes perpendicular to the axis of rotation; there is an unknown 
ID projective transformation along the axis. The ambiguity may be written: 

/x\ / x/(/3'z+l) \ 

Y ^ y/(/3'z + 1) 

\zj \a'z/iP'z + l)J 

Note that since metric structure is determined in planes perpendicular to the 
axis, the angle of rotation between views is known. Figure 0 illustrates this 
projective ambiguity. 




Fig. 5. Projective ambiguity: With no information about the camera or scene, there 
is a ID projective ambiguity in the z direction. Five models of the cup with different 
choices for hs. 



To this point no information on the internal calibration of the camera, or on 
the 3D shape of the object has been used. Internal constraints are provided, for 
example, by that fact that the image pixels have zero-skew, and known aspect 
ratio. Often the zero-skew constraint is not useful in practice because it does 
not resolve the ambiguity m- For example, if the image plane is parallel to 
the rotation axis then all members in the family of solutions for the calibration 
matrix will already satisfy the zero-skew constraint, so it does not provide any 
additional information. 

It can be shown that specifying the aspect ratio places a quartic constraint 
on the parameters a, (3. 

The easiest method of resolution is to use a vanishing point in the scene 
to identify the plane at infinity (we already have the vanishing line of tt/j), for 
example by identifying two or more parallel scene lines. This determines ft .3 up 
to scale (i.e. the ratio a : j3), and the only remaining ambiguity is then a relative 
scaling of the Z and plane directions: 
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Given up to scale, the internal aspect ratio then determines a and /3 uniquely 
(up to sign). Alternatively, the aspect ratio of the object can be used to resolve 
the ambiguity. 



3 Estimation of Camera Matrices 

This section describes the implementation of the algebra developed in the previ- 
ous section. From a raw input sequence we wish to compute the P matrices and 
3D point structure. We first summarize the algebraic procedure of the previous 
section, with the estimation steps then described in more detail below. 

3.1 Algorithm Summary 

Robust point tracks are computed a priori using our general-motion trifocal 
tensor based system [Zj. 

1. For each pair of views fit the planar-motion fundamental matrix (eq. 

2. From one of the Fs determine H up to a 3-parameter ambiguity.0 

3. From each 7) determine /i and the two angles A9i and AOi+i. 

4. Average fj, over the sequence, and angles from overlapping triplets. 

5. Bundle adjust, varying H, 0, and 3D points X, to minimize reprojection error 

3.2 Point Tracking 

This is achieved by tracking interest points (Harris corners 0) through the 
sequence. Tracking is easily achieved by our current general motion system pm, 
based on the trifocal tensor. This functionality is used unchanged in the current 
system, although some speed improvements would certainly accrue if this process 
were also modified to make use of the specialized geometry. Example point tracks 
and track lifetimes are shown in figure 0 Typically, about 150 points are tracked 
in each image triplet, with 2000-3000 points appearing through a sequence. 

3.3 F Estimation 

The fundamental matrix is estimated by first fitting a general-motion F to the 
points. Then the symmetric part of F is truncated to rank 2, and decomposed 
to recover 1^ and the epipoles. This provides a starting point for the special 

^ In the special case where the between- view angles A6i are known to be identical, F is 
estimated from all 2- view correspondences (typically thousands). Then H is extracted 
from this F. Similarly T is fitted to all triplets. 
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10 20 30 
Frame 



(a) 



(b) 



Fig. 6. (a) Point tracks: Some point tracks from the dinosaur sequence. For clarity, 
only the 200 tracks which survived for longer than 7 successive views are shown. In 
total, 3070 points were tracked for 3 or more views, (b) Track lifetimes for dinosaur 
sequence: Each horizontal bar corresponds to a single point track, extending from the 
hrst to last frame in which the point was seen. The measurement matrix is relatively 
sparse, and few points survived longer than 15 frames. 



parametrized form (0, which is fitted by minimizing the distance of points to 
epipolar lines. The average number of point matches per view pair varied from 
137 for the Head sequence to 399 for the Dinosaur. The average distance from 
points to epipolar lines is about 0.3 pixels. 



3.4 T Estimation 

The trifocal tensor is used only to determine fi from three views. From the 
special- form fundamental matrices for the views, the two tensors K, and K,' are 
computed. Then single point correspondences provide candidates for fi. The 
median of the candidates yields the estimate of /i for the triplet. 

3.5 Bundle Adjustment 

The two and three view geometry provides an (excellent) initial estimate for the 
camera matrices. In order to determine the maximum likelihood estimate, we 
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assume that errors in the positions of the 2D points are normally distributed. 
An optimal estimate is then obtained by nonlinear minimization of the distances 
between the reprojected 3D points and the 2D corners HD|. Typical results for 
geometry estimation are shown in figure Q These results are of comparable 
quality with those of m where the camera matrices were determined using a 
calibration pattern. Convergence is generally achieved in 8 iterations, reducing 
the RMS reprojection error from 0.3 pixels to 0.1 pixels. For 2000 points, compute 
time per iteration is of the order of 10 seconds on a 300MHz UltraSparc. The 
radius of convergence is large, the correct minimum being achieved from initial 
estimates where the 9i are in error by up to a factor of 2, although of course 
many more iterations (about 100) are required. 




Fig. 7. Geometry estimation. The graphs show the recovered angles between suc- 
cessive views for each of three sequences, (a) Object rotated by a mechanical turntable 
with a resolution of 1 millidegree. The RMS difference between the angle recovered by 
our algorithm and the nominal value is 40 millidegrees. This demonstrates the accu- 
racy of the angle recovery, (b) (c) Turn-table rotated by hand. The angle increment 
is irregular and unknown a priori. Variation is up to 20° due to missing and repeated 
views, (d) 3D points for dinosaur sequence. 



4 Space Carving and Surface Rendering 

The object is computed as the intersection of the outline cones back-projected 
from all views. The outline in each image is determined by blue-screening. The 
surface of the object is determined very efficiently by an octree based algorithm. 

Octree Growing The octree is initialised as a cube bounding the object, and 
is recursively subdivided to determine the surface. Each cube has one of three 
labels jzg depending on whether it lies entirely inside; entirely outside; or par- 
tially intersects the surface. The former two cases are not of interest, and the 
nodes are not subdivided. The subdividing is stopped at a preset depth. The 
label of a cube is determined by successively projecting it into each image in the 
sequence. An example of the octree “surface” developing is shown in figure 0 



Automatic 3D Model Construction for Turn-Table Sequences 



167 







Fig. 8. Octree generation: the dinosaur octree is grown from a single bounding box. 
The images above show the octree after {left to right) 3, 5, 7 and 8 subdivisions, given 
36 images of the dinosaur. 



Surface Generation The standard marching cubes algorithm provides an 
initial consistent surface which is then smoothed using a localised surface deci- 
mation algorithm. Examples are shown in figures 1^ through II II 




Fig. 9. Texture-mapped dinosaur model. From 36 input images, a 256^ resolution 
volumentric model was generated containing 34752 triangles. 
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(a) 



(b) 



(c) 



Fig. 10. (a) Top view of reconstructed cup points (no points were detected on the 
handle). RMS difference from a fitted cylinder is 0.004 of the diameter, (b) Texture- 
mapped cup model, (c) Shaded Freiburg model. The visual hull effect is apparent here, 
with too few views to penetrate to the object surface. 



Fig. 11. Closeup; High-resolution model of dinosaur hand, showing the fine detail 
recoverable using volume intersection. Head: Texture-mapped model. The shades of 
grey indicate the view from which each texture was taken — for each triangle, the view 
in which it has largest visible area is chosen as the texture source. 



5 Conclusions 

This paper has demonstrated that uncalibrated structure recovery systems based 
on the single-axis motion constraint can produce models of equivalent quality 
to fully calibrated systems, making a-priori calibration and expensive control of 
the viewing environment unnecessary. 

Also of interest are the results of volume intersection as a means of produc- 
ing fully 3D models of arbitrary topologies. Although the “visual hull” effect JOj 
might be expected to severely limit the range of models that can be acquired, the 




Automatic 3D Model Construction for Turn-Table Sequences 



169 



dinosaur and cup experiments (see especially Figure II lil show that surprisingly 
complex models can be acquired. However, it is on the model acquisition phase 
that most plans for future work are centred — given the excellent camera geom- 
etry, more advanced techniques m can be applied. Particularly, correlation of 
the surface texture is expected to allow true super-resolution texture mapping, 
and simultaneously get “inside” the visual hull. 
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Abstract. Structure from motion algorithms typically do not use exter- 
nal geometric constraints, e.g., the coplanarity of certain points or known 
orientations associated with such planes, until a final post-processing 
stage. In this paper, we show how such geometric constraints can be 
incorporated early on in the reconstruction process, thereby improving 
the quality of the estimates. The approaches we study include hallu- 
cinating extra point matches in planar regions, computing fundamental 
matrices directly from homographies, and applying coplanarity and other 
geometric constraints as part of the final bundle adjustment stage. Our 
experimental results indicate that the quality of the reconstruction can 
be signihcantly improved by the judicious use of geometric constraints. 



1 Introduction 

Structure from (image) motion algorithms attempt to simultaneously recover 
the 3D structure of a scene or object and the positions and orientations of the 
cameras used to photograph the scene. Algorithms for recovering structure and 
motion have many applications, such as the construction of 3D environments and 
pose localization for robot navigation and grasping, the automatic construction 
of 3D CAD models from photographs, and the creation of large photorealistic 
virtual environments. 

Structure from motion is closely related to photogrammetry, where the 3D 
location of certain key control points is usually known, thereby allowing the 
recovery of camera pose prior to the estimation of shape through triangulation 
techniques. In structure from motion, however, very few constraints are usually 
placed (or assumed) on the geometric structure of the scene being analyzed. 
This has encouraged the development of mathematically elegant and general 
formulations and algorithms that can be applied in the absence of any prior 
knowledge. 

In practice, however, structure from motion is often applied to scenes which 
contain strong geometric regularities. The man made world is full of planar 
structures such as floors, walls, and tabletops, many of which have known ori- 
entations e.g. horizontal, vertical or known relationships e.g. parallelism and 
perpendicularity. Even the natural world tends to have certain regularities, such 
as the generally vertical direction of tree growth, or the existence of relatively 
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flat ground planes. A quick survey of many recent structure from motion papers 
indicates that the test data sets include some very strong regularities (mostly 
horizontal and vertical planes and lines) which are never exploited Eig, except 
perhaps for a final global shape correction. 

In this paper, we argue using external geometric knowledge can never de- 
crease the quality of a reconstruction so long as this knowledge is applied in a 
statistically valid way. Rather than developing a single algorithm or methodol- 
ogy, we examine a number of different plausible ways to bring geometric con- 
straints to bear, and then evaluate these empirically. In this way, we hope to 
elucidate where geometric constraints can be used effectively. Our experiments 
demonstrate that hallucinating additional correspondences in areas of known 
planar motion, and applying higher order constraints such as perpendicularities 
between planes, can lead to significantly better reconstruction. 

After a brief review of related literature in Section O, we present the ba- 
sic imaging equations, develop the relationships between point positions in two 
views, and show how this reduces to a homography for the case of coplanar points 
(Section Oj) . In Section El we preview the three main approaches we will use to 
solve the structure from motion problem when subsets of points are known to lie 
on planes: augmenting planes with additional sample points before computing 
the fundamental matrix (Section ED; using homographies to directly compute 
the fundamental matrix (Section EJ; using plane plus parallax techniques (Sec- 
tion CJ; and performing global optimization (bundle adjustment) (Section 0. In 
Section El we discuss how additional knowledge about the planes (e.g., perpen- 
dicularity constraints) can be used to improve the solution. Section mil presents 
our experimental setup . We close with a discussion of the results, and a list of 
potential extensions to our framework, including the important case of line data. 

2 Previous Work 

There has been a large amount of work on recovery of structure and motion from 
image sequences (A good introductory text book on the subject is 0)). However, 
relatively little work has been done on incorporating prior geometric knowledge 
(e.g., the coplanarity of points, or known feature orientations) directly into the 
reconstruction process. 

There has been some work in exploiting the motion of one or more planes for 
recovering structure and motion. Luong and Faugeras show how to directly 
compute a fundamental matrix from two or more of the homographies induced 
by the motions of planes within the image. This technique however is very noise 
sensitive. Plane plus parallax technique directly exploit a known dominant pla- 
nar motion to compute the epipole(s) and perform a projective reconstruction 
| ii iii7im| . However, none of these approaches incorporate the geometric con- 
straints of coplanarity in a statistically optimal fashion. 
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3 General Problem Formulation 

Structure from motion can be formulated as the recovery of a set of 3-D structure 
parameters {x^ = (Xi,Yi, Zi)} and time-varying motion parameters {(Rfc,tfc)} 
from a set of observed image features {u^f; = (uik,Vik,l)}- In this section, we 
present the forward equations, i.e., the rigid body and perspective transforma- 
tions which map 3-D points into 2-D image points. We also derive the homography 
(planar perspective transform) which relates two views of a planar point set. 

To project the fth 3-D point into the kth frame at location un-, we write 

Uj/i; ^ V/i;R,;„(Xj tfc)) (1) 

where ~ indicates equality up to a scale, is the rotation matrix for camera k, 
tk is the location of its optical center, and Vj, is its projection matrix (usually 
assumed to be upper triangular or some simpler form, e.g., diagonal). In most 
cases, we will assume that Rq = I and tg = 0, i.e., the first camera is at the 
world origin. The location of a 3D point corresponding to an observed image 
feature is 

Xj — rCjfcR^ k T (^) 

where Wik is an unknown scale factor. 

It is useful to distinguish three cases, depending on the form of V^. If is 
known, we have the calibrated image case. If is unknown and general (upper 
triangular), we have the uncalibrated image case, from which we can only recover 
a projective reconstruction of world ^ . If some information about V*, is known 
(e.g., that it is temporally invariant, or that it has a reduced form), we can apply 
self- calibration techniques 0I2I- 

The motion of a point between two images k and I can thus be written as 
^ I tfc) ~ T ^il ^kli (3) 

with Rfc; = RfeR)”^. The matrix is the homography (planar 

perspective transform) which maps points at infinity {wjj^ = 0) from one image 
to the next, while Bki = VfcRfc(t; — t^) is the epipole which is the vanishing 
point of the residual parallax vectors once this planar perspective motion has 
been subtracted (the epipole is also the image of camera k’s center in camera I’s 
image, as can be seen by setting wu — > 0). 

When the cameras are uncalibrated, i.e., the can be arbitrary, the homog- 
raphy cannot be uniquely determined, i.e., we can add an arbitrary matrix 
of the form to and subtract a plane equation v^u*, from and still 

obtain the same result. More globally, the reconstructed 3D shape can only be 
determined up to an overall 3D global perspective transformation {collineation) 

pinij . 

The inter-image transfer equations have a simpler form when x is known to 
lie on a plane fi^x — d = 0. In this case, we can compute wu using 

n^Xi — d = Wiih'^'R.f^'Vf^Uii n^ti — d = 0, 
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or 

where di = d — n^t; is the distance of camera center I (tj) to the plane (n,d). 
Substituting w~i^ into @ and multiplying through by di, we obtain m 

u,fc ~ (H^ + dr'efczn^Rr^Vf (4) 

Letting n; = V;“^Rin be the plane normal in the Zth camera’s (scaled) coordi- 
nate system, we see that the homography induced by the plane can be written 
as 

Hki ~ + dT^e^inf (5) 

i.e., it is very similar in form to the projective ambiguity which arises when 
using uncalibrated cameras (this also forms the basis of the plane plus parallax 
techniques discussed below). 

4 Structure from Motion with Planes 

In the remainder of this paper, we develop a number of techniques for recovering 
the structure and motion of a collection of points seen with 2 or more cameras. 
In addition to being given the estimated position of each point in two or more 
images, we also assume that some of the points are coplanar. We may also be 
given one or more image regions where the inter-frame homographies are known, 
but no explicit correspondences have been given. 

Given this information, there are several ways we could proceed. 

1. We can, of course, solve the problem ignoring our knowledge of coplanarity. 
This will serve as our reference algorithm against which we will compare all 
others. 

2. We can hallucinate (additional) point matches based on the homographies 
which are either given directly or which can be computed between collections 
of coplanar points. 

3. We can re-compute the 2D point locations so that the estimated or computed 
homographies are exactly satisfied. 

4. We can use the homographies induced by the planes in the image to estimate 
the fundamental matrix, and thence structure. 

5. We can use plane -I- parallax techniques to recover the camera geometry, and 
after that the projective 3D structure. 

6. We can perform a global optimization (bundle adjustment), using the knowl- 
edge about coplanarity as additional constraints to be added to the solution. 

To illustrate these algorithms, we initially use two simple data sets (Figure 

EJ: 

1. a collection of n points lying in a fronto-parallel plane with m points lying 
on a closer fronto-parallel plane; 
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Fig. 1. Experimental datasets (front and top views): Front view shows the 
location of the points projected into image 1 (black symbol) and image 2 (grey 
symbol) . Top view shows the relative 3d disposition of the points in orthographic 
projection from above, (a) n = 5 points lying on a plane with m = 4 points lying 
in front (b) n = 4 points on each face of a trihedral vertex. For our experiments, 
we use to = 6 and rotate the data around the vertical axis through 10°. 



2. a trihedral vertex with n points on each of the three faces, with two of the 
points on each face being located along the common edge. 

Although we could, we will not use data sets where homographies are directly 
given. Instead, we compute whatever homographies we need from the (noisy) 
2D point measurements, and use these as inputs. A more detailed explanation 
of our data and methodology is given in Section El 

5 Fundamental Matrices from Point Correspondences 

Referring back to the basic two- frame transfer equation Q , we can pre-multiply 
both sides by [e^i] x , where [v] x is the matrix form of the cross-product operator 
with vector v, to obtain 



~ [e/g/] X 

(since [e^/Jx annihilates the eki vector on the right hand side). Pre-multiplying 
this by u^, we observe that the left-hand side is 0 since the cross product matrix 
is skew symmetric, and hence 



ufi^FkiUa = 0, (6) 

where 

Fki - [efc,]xVfeRfe,V-i = [efc,]xH“ (7) 

is called the fundamental matrix P). The fundamental matrix is of rank 2, since 
that is the rank of [e^^Jx, and has seven degrees of freedom (the scale of F is 
arbitrary) . 

When the camera calibration is known, we can premultiply screen coordinates 
by (he., convert screen coordinates into Euclidean directions), and obtain 
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the simpler essential matrix, E ^ which has two identical non-zero 

singular values, and hence 5 degrees of freedom (the fundamental matrix has 7). 

The fundamental matrix or essential matrix approach to two-frame structure 
from motion is one of the most widely used techniques for structure from motion, 
and some recent modifications have made this technique quite reliable in practice 
The essential matrix method was first developed for the calibrated image 
case m- This method was then generalized to the fundamental matrix approach 
which can be used with uncalibrated cameras. 

Once the fundamental (or essential) matrix has been computed, we can es- 
timate eki and then compute the desired homography — [^/czjxE^^. 

The 3D location of each point can then be obtained by triangulation, and in our 
experiments, this can be compared to the known ground truth. 

As mentioned earlier, when we know that certain points are coplanar, we 
can use this information in one of two ways: (1) hallucinate (additional) point 
matches based on the homographies; or (2) re-compute the 2D point locations 
so that the estimated or computed homographies are exactly satisfied. 

5.1 Hallucinating Additional Correspondences 

The first approach proves to be useful in data-poor situations, e.g., when we 
only have four points on a plane, and two points off the plane. By hallucinating 
additional correspondences, we can generate enough data (say, two additional 
points on the plane) to use a regular 8-point algorithm. If it helps for data poor 
situations, why not for other situations as well (say, eight points grouped onto 
two planes)? Eventually, of course, the new data must be redundant, but at what 
point? Methods which exploit homographies directly (Section 0 indicate 
that there are six independent constraints available from a single homography. 
Is this so when the data is noisy? 

Let’s get a feel for how much additional points help by running some exper- 
iments. Table 0 shows the results of adding p hallucinated points per plane to 
both of our test data sets (bi-plane and trihedral) Q and then running an 8-point 
algorithm to reconstruct the data jSESj. For the initially underconstrained data 
sets (n = 4,m = 2 bi-plane and n = 4 trihedral), and even for the minimally 
constrained data sets (n = 4, m = 4 bi-plane), adding enough hallucinated points 
to get more than the minimum required 8 provides a dramatic improvement in 
the quality of the results. On the other hand, adding hallucinated points to data 
set which already have more than 8 points only gives a minor improvement. 
This suggests that having more than the minimal number of sample points is 
more important than fully exploiting all of the constraints available from our 
homographies. 



^ The n = 6, m = 2 and n = 4, m = 2 data sets actually only have a single plane for 
which a homography can be compnted. 
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data set 


n m p N 


method 


Euclidean 


affine 


co-planarity 


plane -I- 2pts 


T 


2 2 8 


“8 pt” F 


0.0651 


0.0130 


0.0023 


V) 


4 


2 0 6 


plane -I- ||ax 


0.0651 


0.0130 


0.0024 


plane -I- 2pts | 


6 


2 0 8 


“8 pt” F 


0.1879 


0.0430 


0.0149 


55 


6 


2 19 


55 


0.1482 


0.0285 


0.0158 


55 


6 


2 0 8 


plane -|- ||ax 


0.1185 


0.0184 


0.0105 


2 II planes | 


4 


40 8 


“8 pt” F 


0.1382 


0.0335 


0.0141 


55 


4 


4 1 10 


55 


0.0858 


0.0235 


0.0128 


55 


4 


4 2 12 


55 


0.0702 


0.0200 


0.0100 


55 


4 


40 8 


plane -|- ||ax 


0.1709 


0.0395 


0.0077 


2 II planes 


5 


5 0 10 


“8 pt” F 


0.0538 


0.0226 


0.0144 


55 


5 


5 0 10 


reproject 


0.0484 


0.0189 


0.0114 


55 


5 


5 0 10 


H ^ F 


0.5516 


0.3698 


0.0163 


55 


5 


5 0 10 


plane -I- ||ax 


0.0673 


0.0189 


0.0079 


55 


5 


5 0 10 


bundle adj. 


0.0467 


0.0170 


0.0092 


55 


5 


5 0 10 


plane enf. 


0.0392 


0.0117 


0.0000 


55 


5 


5 0 10 


plane constr. 


0.0384 


0.0081 


0.0000 


6 II planes 


4 


4 0 24 


“8 pt” F 


0.0761 


0.0234 


0.0074 


55 


4 


4 0 24 


H ^ F 


0.9459 


0.7652 


0.0088 


55 


4 


4 0 24 


plane -I- ||ax 


0.8145 


0.5312 


0.0078 


tilted cube 


4 


4 1 10 


“8 pt” F 


0.1549 


0.0307 


0.0091 


55 


4 


4 2 13 


55 


0.1301 


0.0265 


0.0076 


55 


4 


40 7 


H ^ F 


0.1383 


0.0237 


0.0079 


55 


4 


40 7 


plane -I- ||ax 


0.2070 


0.0411 


0.0087 


tilted cube 


5 


5 0 10 


“8 pt” F 


0.1460 


0.0295 


0.0110 


55 


5 


5 1 13 


55 


0.1263 


0.0256 


0.0111 


55 


5 


5 0 10 


H ^ F 


0.1014 


0.0213 


0.0093 


55 


5 


5 0 10 


plane -I- ||ax 


0.1657 


0.0348 


0.0107 



f Randomized data point placement 



Table 1. Reconstruction error for various methods of structure estimation, n 
and m are defined in Figure 1, p is the number of extra hallucinated points, 
and N is the total number of points. The Euclidean and affine reconstruction 
errors are for calibrated cameras. The coplanarity error measures the Euclidean 
distance of points to their best-fit plane (calibrated reconstruction). 



5.2 Reprojecting Points Based on Homographies 

A second approach to exploiting known coplanarity in the data set is to perturb 
the input 2D measurements such that they lie exactly on a homography. This 
seems like a plausible thing to do, e.g., projecting 3D points onto estimated 
planes is one way to “clean up” a 3D reconstruction. However, it is possible that 
this early application of domain knowledge may not be statistically optimal or 
even admissible. Let’s explore this idea empirically. 
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The simplest way to perform this reprojection is to first compute homogra- 
phies between a plane in the kth frame and the 0th frame, and to then project 
the points from the first frame into the fcth frame using this homography. This is 
equivalent to assuming that the points in the first frame are noise-free. Another 
approach is to find such that they exactly satisfy the homographies and 
minimize the projected errors. Since the latter involves a complicated minimiza- 
tion, we have chosen to study the former, simpler idea. Methods to incorporate 
coplanarity as a hard constraint on the solution will be presented in Section 0. 

Tablenjshows some results of reprojecting points in the second frame based on 
the computed homographies (row reproject). A slight decrease in error is visible, 
but this technique does not yield as dramatic improvements as hallucinating 
additional correspondences. 

6 Fundamental Matrices from Homographies 

Assuming that we are given (or can estimate) the inter-frame homographies 
associated with two or more planes in the scene, there is a more direct method 
for computing the fundamental matrix m Recall from that the homography 
associated with a plane fi^x — d = 0 is Hki ~ -|- d)j^ekin^ and that the 

fundamental matrix o associated with the same configuration has the form 
Fki [efc;]xH^. The product 

is skew symmetric, and hence 

S = Ul,Fki + Fl,Uki = 0. (8) 

Writing out these equations in terms of the entries hij and fij of H and F (we’ll 
drop the kl frame subscripts) gives us 

^ij — ^ ^ kikifkj T fkikkj — 0, '^(bj)- (9) 

k 

Each known plane homography H contributes six independent constraints on F, 
since the matrix H^F -|- F^H is symmetric, and hence only has six degrees of 
freedom. Using two or more plane homographies, we can form enough equations 
to obtain a linear least squares problem in the entries in F. 

While this idea is quite simple, Luong and Faugeras PI report that the 
technique is not very stable (it only yields improvements over a point-based 
technique for two planes). Can we deduce why this method performs poorly? 

Instead of using (0) to solve for F, what if we “hallucinate” point corre- 
spondences based on the known homographies. Say we pick an image point 
u = (uo,ui,rt 2 ) and project it to u' = (uo,ui,U 2 ) = Hu, i.e., Vk = 

The resulting constraint on F (0 has the form 

^ ^ '^kfkj'^j — ^ ^ fkjhki'^i'^j — 0* 

Ij ijk 



( 10 ) 
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By choosing appropriate values for we can obtain elements (or 

combinations of the elements) of the symmetric S matrix in For example, 
when u = 6i, we get '^f.fkihki = ^sa. Thus, three of the constraints used 
by d correspond to sampling the homography at points (1, 0, 0), (0,1,0), and 
(0,0, 1), two of which lie at infinity! Similarly, for u = Si + Sj,i yf j, we get 
fkihki + fkihk] + fkjhki + fkjhkj = ^Sii+^Sjj + Sij. Thus, the remaining three 
constraints used by d are linear combinations of constraints corresponding to 
three sample points, e.g., (0,1,1), (0,1,0), and (0,0,1). Again, each constraint 
uses at least one sample point at infinity! This explains why the technique does 
not work so well. First, the homographies are sampled at locations where their 
predictive power is very weak (homographies are most accurate at predicting the 
correspondence within the area from which they were extracted). Second, the 
resulting sample and projected points are far from having the kind of nice unit 
distribution required for total least squares to work reasonably well. 

To demonstrate the overall weakness of this approach, we show the recon- 
struction error using the method of d in Tabled From these results, we can 
see that the approach is often significantly inferior to simply sampling the same 
homography with sample points in the interior of the region from which it was 
extracted. The six-plane data set (Figure |2I) is representative of the kind of 
data used in d. where they partitioned the image into regions and then scat- 
tered coplanar points within each region. For the trihedral data set, however, 
the homography-based method works quite well. To obtain comparable results 
using the point hallucination method, quite a few additional sample points need 
to be used. At the moment, we do not yet understand the discrepancy between 
the fronto-parallel and trihedral data set results. A plausible conjecture is that 
fronto-parallel data, whose “vanishing points” lie at far away from the optical 
center, are more poorly represented by an H matrix. 



■< X X X ++ + + 

+ + +-tQ d g d 
+ + +-!□ □ □ □ 

J] O X * 

g I °l ^ I ^ l« I * 

front (camera) 




top (orthographic) 



Fig. 2. Points clustered onto 6 fronto-parallel planes 



7 Plane plus Parallax 

Another traditional approach to exploiting one or more homographies between 
different views is to choose one homography as the dominant motion, and to 
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compute the residual parallax, which should point at the epipole. Such plane plus 
parallax techniques |11I17| are usually used to recover a projective description 
of the world, although some work has related the projective depth (magnitude 
of the parallax) to Euclidean depth. 

To compute the fundamental matrix, we choose one of the homographies, 
say the first one, and use it to warp all points from one from to the other. We 
then compute the epipole by minimizing the sum of squared triple products, 
(xi,x(,e)^, where Xi and x^ are corresponding (after transfer by homography)Q 
Once e has been determined, we can compute Fki- 

Table Q] shows the results of using the plane plus parallax to recover the 3D 
structure for some of our data sets. The method works well when the points are 
mostly on the plane used for the homography (n = 4 or 6, m = 2), but not 
as well when the points are evenly distributed over several planes. This is not 
surprising. Plane plus parallax privileges one plane over all others, forcing the 
fundamental matrix to exactly match that homography. When the data is more 
evenly distributed, a point-based algorithm (with hallucination, if necessary) 
gives better results. 



8 Global Optimization (Bundle Adjustment) 



The final technique we examine in this paper is the one traditionally used by 
photogrammetrists, i.e., the simultaneous optimization of 3D point and camera 
placements by minimizing the squared error between estimated and measured 
image feature locations. 

There are two general approaches to performing this optimization. The first 
interleaves structure and motion estimation stages This has the advantage 
that each point (or frame) reconstruction problem is decoupled from the other 
problems, thereby solving much smaller systems. The second approach simul- 
taneously optimizes for structure and motion m- This usually requires fewer 
iterations, because the couplings between the two sets of data are made explicit, 
but requires the solution of larger systems. In this paper, we adopt the former 
approach. To reconstruct a 3D point location, we minimize 




pLx* / 



2 




( 11 ) 



where pkr are the three rows of the camera or projection matrix 



Pfc = Vfc[Rfc| — tk] 

and Xi = [xi|l], i.e., the homogeneous representation of x^. As pointed out 
by |2S1, this is equivalent to solving the following overconstrained set of linear 
equations. 

The triple product measures the distance of e from the line passing through x; and 
x', weighted by the length of this line segment. 
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Afc^(Pfeo - ■WifePfc2)^Xi = 0 (12) 

- Vik-Pk2f^i = 0 , 

where the weights are given by Dik = Pk 2 ^i (these are set to Dik = 1 in the 
first iteration) 13 Notice that since these equations are homogeneous in x^, we 
solve this system by looking for the rightmost singular vector of the system of 
equations 0. 

The same equations can be used to update our estimate of Pfc, by simply 
grouping equations with common fc’s into separate systems. When reconstruct- 
ing a Euclidean (Vfc[R/j| — t^]) description of motion, the estimation equations 
become more complicated. Here, applying a linearized least squares like the 
Levenberg-Marquardt algorithm is more fruitful. Let us assume the following 
updates 

R-fc ^ R-fc (I + [wfc] X ) , tk^tk + 6tk- (13) 

We can compute the terms in 

Pk + <5Pfe = Vk[Rk{l + K]x)| - (tfc + <5tfe)] (14) 

as functions of Vfc,Rfc,tfc, i.e., the Jacobian of the twelve entries in JP^ with 
respect to uik and 6tk- We can then solve the system of equations 

('^PfeO - UikSpk2) = Uik - Uik (15) 

(Jpfci - VikSpk2) = Uik - Vik, 

substituting the Spk with their expansions in the unknowns (w/c, Stk). The rota- 
tion and translation estimates can then be updated using ED, using Rodriguez’s 
formula for the rotation matrix 

R V- R (I -I- sin0[fi]x + (1 — cos0)[fi]x) 

with 9 — ||a;||, n = lu/9. A similar approach can be used to update the focal 
length, or other intrinsic calibration parameters, if desired. 

The above discussion has assumed that each point can be solved for inde- 
pendently. What about points that are known to be coplanar? Here, we need to 
incorporate constraints of the form njxi — dp = 0,{i G Up}. Two approaches 
come to mind. The first is to alternate a plane estimation stage with the point re- 
construction stage. The second is to simultaneously optimize the point positions 
and plane equations. We describe the former, since it is simpler to implement. 

Fitting planes to a collection of 3D points is a classic total least squares 
problem |^. After subtracting the centroid of the points, Xp, we compute the 

^ The Levenberg-Marquardt algorithm 1151 leads to a slightly different set of equations 

(PfcO UikPk2) JXi = Uik Uik^ 



where Uik is the current estimate of Uik and SSti is the desired update to Xi. In 
practice, the two methods perform about as well 
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singular value decomposition of the resulting deviations, and choose the right- 
most singular vector as the plane equation. We then set dp = njxp. 

To enforce this hard constraint on the point reconstruction stage, we add the 
equation h^Xi — dp = 0. to the system da as a linear constraint (0| • Since points 
may end up lying on several planes, we use the method of weighting approach to 
constrained least squares p. 586], i.e., we add the constraints njxi — dp = 0 
to the set of equations for x^ with a large weight (currently 2^^ « 10^°). 

Tabled shows the results of applying bundle adjustment to the initial struc- 
ture and motion estimates computed using an 8-point method. For fronto-parallel 
planes, bundle adjustment significantly reduces the reconstruction error. For the 
trihedral data set, it has little effect. Notice, however, the large discrepancy be- 
tween the Euclidean and affine reconstruction errors for the trihedral data. This 
suggests that the major source of error is probably a bas-relief ambiguity m, 
which is not removable even with a statistically optimal technique such as bundle 
adjustment. Enforcing coplanarity (“plane enf.” in Table does not significantly 
reduce the reconstruction error, although it is successful at reducing coplanarity 
error to 0 (which may be desirable to make the data appear less “wobbly” ) . 

9 Constraints on Planes 

In addition to grouping points onto planes, we can apply additional constraints 
on the geometry of the planes themselves. For example, if we know that two or 
more planes are parallel, then we can compute a single normal vector for all the 
“coplanar” points after their individual centroids have been subtracted. 

The line corresponding to method “plane constr.” in Table Q] shows the result 
of applying a parallelism constraint to our fronto-parallel data set. The results 
are not all that different from not using the constraint. 

If we know that certain planes are perpendicular, this too can be enforced 
during the normal computation stage. If two or three planes are known to be 
mutually orthogonal, we can concatenate the normals into a matrix, compute 
its SVD, replace the singular values with 1, and reconstitute the matrix. 

Applying this idea to the trihedral data set as part of the bundle adjustment 
loop yields dramatically lower reconstruction errors (Table EJ . Adding the per- 
pendicularity constraint removes most of the bas-relief ambiguity (uncertainty) 
in the reconstruction, with the resulting reconstruction error being more closely 
tied to the triangulation error. 

Lastly, if planes have explicitly known orientations (e.g., full constraints in 
the case of ground planes, or partial constraints in the case of vertical walls), 
these too can be incorporated. However, a global rotation and translation of 
coordinates may first have to be applied to the current estimate before these 
constraints can be enforced. 
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10 Experiments 

We have performed an extensive set of experiments to validate our algorithms 
and to test the relative merits of various approaches. Our experimental software 
first generates a 3D dataset in one of three possible configurations: a set of 
fronto-parallel planes filling the field of view (Figure^), a set of fronto-parallel 
planes in non-overlapping regions of the image (Figure 0, or a trihedral corner 
(Figure On each of the planes, we generate from 1 to 9 sample points, in 
the configurations shown in Figure 0 

The 3D configuration of points is projected onto the camera’s image plane. 
For our current experiments, we rotate the data around the y axis in increments 
of 10°, and place the camera 6 units away from the data (the data itself fills 
a cube spanning [—1,1]^). We then generate 50 noise-corrupted versions of the 
projected points (for the experiments described in this paper, a = 0.2 pixels, on 
a 200 X 200 image), and use these as inputs to the reconstruction algorithms. 
The mean RMS reconstruction error across all 50 trials is then reported. 

The reconstruction errors are computed after first finding the best 3D map- 
ping (Euclidean/similarity or affine) from the reconstructed data points onto the 
known ground truth points. The columns labeled “Euclidean” and “affine” in 
Table Emeasure the errors between data reconstructed using calibrated cameras 
and the ground truth after finding the best similarity and affine mappings. The 
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coplanarity error is computed by finding the best 3D plane fit to each copla- 
nar set of reconstructed points (calibrated camera), and then measuring the 
distances to the plane. 

TableQis a representative sample from our more extensive set of experiments. 



11 Discussion and Conclusions 

In this paper, we have presented a number of techniques for exploiting the ge- 
ometric knowledge typically available in structure from motion problems. In 
particular, we have focused on how to take advantage of known coplanarities 
in the data. Our techniques also enable us to directly exploit homographies be- 
tween different regions of the image, when these are known. Of the techniques 
tried, hallucinating additional correspondences is simple to implement, and often 
yields a significant improvement in the results, especially in situations which are 
initially data-poor. Reprojecting the data to exactly fit the homography does not 
appear to significantly improve the results. Using homographies to directly esti- 
mate the fundamental matrix sometimes works, but also often fails dramatically; 
using hallucinated correspondences seems like a more prudent approach. 

Bundle adjustment improves the results obtained with the 8-point algorithm, 
but often not by that much. Adding coplanarity as a hard constraint does not 
seem to make a significant difference in the accuracy of the reconstruction, al- 
though it does make the reconstruction look smoother. Adding parallelism as a 
geometric constraint does not seem to improve the results that much. On the 
other hand, adding perpendicularity constraints for the trihedral data set leads 
to a dramatic decrease in reconstruction error (most likely due to a reduction 
in the bas-relief ambiguity). As mentioned above, plane plus parallax works well 
when the points are mostly on the plane used for estimating the homography, 
but not as well when the points are evenly distributed over several planes. 

These results suggest that adding hallucinated correspondences to planar 
grouping of points (or hallucinating correspondences in regions with known ho- 
mographies) is a useful and powerful idea which improves structure from motion 
results with very little additional complexity. Similarly, geometric constraints 
(coplanarity, parallelism, and perpendicularity) can be added to the bundle ad- 
justment stage with relatively little effort, and can provide significantly improved 
results. 



11.1 Future Work 

This paper has concentrated on the geometric constraints available from know- 
ing that certain points are coplanar. Similar constraints are available for points 
which are known to be collinear. The situation, however, is often a little differ- 
ent: line matching algorithms often do not localize the endpoints of lines in each 
image, so there may be no initial points in correspondence, nor is it possible 
to hallucinate such correspondences prior to an actual reconstruction. However, 
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exploiting known orientations for lines (e.g., vertical and horizontal), and geo- 
metric constraint between their orientations (parallelism and perpendicularity) 
is indeed possible, and can lead to algorithms which reconstruct a 3D scene from 
a single view. 

In terms of points on planes, our current results could be extended in a num- 
ber of directions. First, we have not yet explored the use of multi-frame algebraic 
approaches such as trilinear tensors m- Second, we have not explored multi- 
frame bundle adjustment techniques, nor have we explored the use of robust 
estimation techniques 1231 . Hallucinating correspondences should be equally ap- 
plicable to all three of these approaches. We would also like to better understand 
the differences in results obtained from fronto-parallel and oblique planes, and 
in general to anticipate the expected accuracy of results for various geometric 
configurations and camera motions. 
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Abstract. The paper deals with the structure-motion problem for un- 
calibrated cameras, in the case that subsidiary information is available, 
consisting e.g. in known coplanarities or parallelities among points in the 
scene, or known positions of some focal points (hand-eye calibration). De- 
spite unknown camera calibrations, it is shown that in many instances 
the subsidiary information makes affine or even Euclidean reconstruction 
possible. A parametrization by affine shape and depth is used, providing 
a simple framework for the incorporation of apriori knowledge, and en- 
abling the development of iterative, rapidly converging algorithms. Any 
number of points in any number of images are used in a uniform way, with 
equal priority, and independently of coordinate representations. More- 
over, occlusions are allowed. 



1 Introduction 

The structure and motion problem is central for computer vision, dealing with 
the analysis of a 3D scene by means of a sequence of 2D images. It is often 
studied by epipolar geometry and multilinear constraints, cf. m, 1^, Q, iHi, inii 
0. uni, EH, m. m The present paper uses an alternative approach, based 
on the notions of affine shape and depth, developed in a series of papers H2|, 

IEC^, ini, fcs], |IE|, ini, |I3- 

Depending on the apriori information available, the structure and motion 
problem can be treated on different levels. In the case of uncalibrated cameras it 
is well known that only projective reconstruction is possible, cf. 0, m Work- 
ing with point configurations, we here consider the case when some affine or 
Euclidean knowledge about the scene or the camera locations is available, e.g. a 
number of occurences like ’two lines are parallel’ or ’a line is parallel to a plane’, 
in which case affine reconstruction can be achieved. When having in addition 
some sort of Euclidean information, like a city map in the case of pictures of 
a city scene, this may be strengthened to Euclidean reconstruction. Another 

* The work has been supported by the ESPRIT reactive LTR project 21914, CUMULI, 
and by the Swedish Council for Engineering Sciences (TFR), project 95-64-222. 
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situation considered is when the relative placement of at least five focal points 
are known, where is is shown that not only the projective but also the affine 
structure of the scene can be recovered. Again, having Euclidean information 
about the focal points, this can be strengthened to Euclidean reconstruction of 
the scene. The latter situation appears naturally in hand-eye calibration from 
pictures taken by a camera mounted on a moving robot arm, with registration 
of the motion parameters. It is also shown how to adjust a reconstruction to be 
consistent with coplanarity constraints for some set of points. 

The notion of affine shape is well suited to handle these situations, theoreti- 
cally as well as computationally. To gain robustness, numerical computations are 
based on a variational formulation, possible to exploit by linear algebraic meth- 
ods. Data from any number of points in any number of images can be treated 
simultaneously, without preselection of reference points or images. In particular, 
there is no need to handle the numerically unstable situation of overdetermined 
systems of polynomial equations with uncertain coefficients. 

The plan of the paper is as follows. In Section 2 a brief recapitulation of the 
notions of affine depth and shape and their use in single view geometry is given. 
Section 3 deals with multiple view geometry along the same lines. With this 
background. Section 4 presents algorithms for projective reconstruction. These 
are then extended in Section 5 to affine and Euclidean reconstruction in the case 
of subsidiary information. More details and discussions about the notions of 
affine shape and depth are given in a self-contained and independent Appendix. 



2 Single View Geometry by AfRne Shape and Depth 

Let denote the three-dimensional affine space, let II' be a plane in A^, the 
image plane, and let II he & subset of A^ . By is meant the perspective trans- 
formation n ^ n' with centre (j). Here cj) is allowed to be a point at infinity, 
in which case is a parallel projection in direction (j). Perspective transforma- 
tions model the pinhole camera. If no metrical information is known or used, the 
camera is said to be uncalibrated. 

The principal objects dealt with in this paper are n-point configurations X, by 
which is meant ordered sets of points X = {X ^, . . . , A"), where G A^, fc = 
1, . . . , n. Let denote the dimension of X, e.g. pA’ = 3 if A is a non-planar 3D- 
configuration. The set of n-point configurations of dimension p will be denoted 
C 

In a series of papers C2!, m, m, csi, the notions of affine shape space 
and affine depth space have been developed. The definitions and main properties 
are summarized below. For a somewhat more thorough presentation, see the 
accompanying appendix. 

The following notation is used throughout the paper: If a = (ai, . . . , a„), ^ = 
(6) • ■ • let = (aiCi, ■ ■ ■ ,a„G)> and let a = (l/oi, . . . , l/a„). Moreover, 

ietro = {^GK"| Eia = o}- 
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— Definition of affine shape and depth spaces. Let be the coordinate column 
vector of with respect to an arbitrary affine basis, k = 1, . . . ,n. Then 
the affine shape space and the affine depth space are defined by 



s(T) = Af 



1 1 



and d(^Pd^ — *7^row 



1 1 



respectively, where N stands for nullspace, and TZrow for rowspace. (Cf. 
Definition DO of the Appendix.) 

— Affine invariancy. There exists an affine transformation A : X — > X' if and 
only if s{X) C s{X'), or, equivalently, d{X') C d{X). If X is restricted to 
planar configurations, then the inclusions are in fact equalities, s{X) = s{X') 
and d(X') = d(X), respectively. In this case we also write X = X' . (Cf. 

TheorLo) 

— Dimension. The dimensions of the linear spaces s{X) and d(X) are related to 

the dimension of the configuration by dim s(A’) = n — px — ^ dimd(fL) = 

+ 1, respectively. (Cf. Theorem IA.2I 1 

— Shape and depth theorem. There exists a perspective transformation P such 
that P{X) = y with depth a if and only if as(X) C s(3^), or, equivalently, 
ad{y) C d{X). If X is restricted to planar configurations, then the inclu- 
sions are replaced by as{X) = s(J^) and ad{y) = d{X), respectively. (Cf. 
Theorem Ell) 

— Definition of S- and D-matrices. By an S'-matrix of X is meant a matrix 
having s{X) as column space. By a Z?-matrix is meant a matrix having 
d{X) as row space. Passage between different ^-matrix representations is 
performed by multiplication from the right by a non-singular matrix. (Cf. 
Definition E3) 

— Focal point theorem. If as(X) C s(J^), then there exists a projection P^ : 
X — > y if and only if 



n n 

(/> = ^ ak-rikX '"/ ^ akT]k , 

k—1 k—1 



where p S s((y) \ as{X). The compound configuration (X,(p) thus has an 
S'-matrix 



S{X,c/,) 



diag {a)Sy 
-a'^Sy 



(Cf. Theorem IB. 21 1 



3 Multiple Views 

3.1 Main Theorem 

Suppose that y^,... ,y^ G Cn ,2 are projective images of one and the same 
configuration X G Cn,^. The shape and depth theorem implies that a^s{X) C 
s(3f^), . . . ,a"*s(<T) C s(3^™), or, equivalently, 

s{X) C a^s((V^) , ■ . ■ , s{X) C 5™s(3^’”) , where a^, . . . , o’” e d{X) . 
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From this the equivalence between the first two items in the following theorem 
follows. An analogous argument for the depth space yields the equivalence with 
the third item. 

Theorem 1. Structure theorem. Let X € Cn,p- The following statements are 
equivalent: 

- & Cn,p-i is a projective image of X with depth vector , i = I, . . . ,m, 

where not all projections are flat, 

- s(A) = ais(37i)n 

- d{X) = a^d{y^) + . . . + a™d(3^™). 

First note the ambiguity in the last two items, consisting in that they remain 
valid after multiplication with any (3 in d(X), giving rise to a new consistent 
reconstruction, with shape space (3s{X). This is the shape-depth formulation of 
the well-known projective reconstruction ambiguity, cf. m- Also note that since 
dimd(A’) = 4, the ambiguity is governed by four independent components of fl. 

To indicate the usage of the theorem, consider the equivalence of the first two 
items in the case m = 2. First normalize by multiplying by /3 = a^, then put = 
ja^, and let A|| denote the corresponding reconstructed object configuration. 
Here q^ is called kinetic depth, and the notation X^\ comes from the fact that y^ 
is formed by a parallel projection. The condition to fulfill is 

s(A||) = s(3;i)f|9's(3^') • (1) 

To analyze this condition, choose S'-matrices Syi, Sy 2 , and form the com- 
pound matrix 

Wg{y\y^) = [%i I diag{t)Sy2] . 

A dimension argument yields that a necessary condition for to be fulfilled is 
that 

d\mNWq{y^ ,y"^) = n — p — 1 rank = n — p-l- 1 . 

One way to proceed is by forming polynomial equations from the vanishing 
of all subdeterminants of Wq of order n — p. For reasons that will be discussed 
in Section tt.l-a we prefer another method, described in Section 14. 1 1 However, 
once q^ is determined, A|| can be computed as the intersection space in after 
which all other consistent reconstructions are obtained by multiplication with 
fl G d(A’||). As remarked above, this gives a four parameter family of solutions 
to the reconstruction problem. 



3.2 The Chasles Matrix 

Next we combine TheorernHwith the focal point theorem, to describe the inter- 
play between structure and motion. By means of the S'-matrices of the respective 
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images and the depth vectors . , a™, a compound matrix, called the Chasles 
matrix, is formed 



C(y\... ,7^) = 



diag (a^)5yi 
— cx. 



diag(^)5'j;>i 

0 

0 



(2) 



Theorem 2. Structure and motion theorem. Let X C Cn,p- The following state- 
ments are equivalent. 

— y^ & Cn,p-i is a projective image of X with depth vector , i = 1, . . . ,m, 
where not all projections are fiat, 

— the Chasles matrix C{y^ , . . . ,y"^;a^, . . . ,a™) is an S-matrix of the com- 
pound configuration {X, (j)^, . . . , 



From this theorem it follows that a necessary and sufficient condition for 
geometric consistency is that the Chasles matrix has rank m+n— 4. In particular, 
this means that when having fixed the locations of four of the m + n points 
Xi , . . . , Xn, 4>^, - ■ ■ , all the others are known too, as linear expressions in 
the four selected points. These expressions can be read out explicitely from the 
Chasles matrix, as illustrated by the following example. 



Example 1. Let y^ and y"^ be defined by their ^-matrices 






00 

00 




1 5' 


-4 -1 




-1 2 


-1 -4 


II 


1 -4 


-3 0 




-1 0 


0 

1 

CO 




CO 

1 

0 



Then 



Wg{y\y^) = 



8 8 5 ?! 

—4 —1 —52 2^2 

-1-4 53 -453 

—3 0 —54 0 

0—3 0 — 355 



One verifies that if 5 = (3, 6, 9, 1,2), then rank IT, = 3. According to 0, a 
Chasles matrix is obtained by enlarging W, with two rows, in such a way that 
all column sums vanish: 



8 8 3 15' 

-4 -1 -6 12 

-1 -4 9 -36 
-3 0-1 0 

0-3 0-6 
0 0 0 0 
0 0-5 15. 



c{y\y^,i,q) 
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Here the first five rows correspond to points of X, while rows six and seven 
correspond to (j)^ and 0^, respectively. 

Another Chasles matrix is obtained by elimination of the 0^-component in 
the fourth column: 

- 8 8 3 24 - 

-4 -1 -6 -6 

-1 -4 9 -9 

C{y\y\l,q)= -3 0 - 1-3 . 

0-3 0-6 

0 0 0 0 

. 0 0 -5 0 . 

From the fourth and third columns of C we read out that 

s(A||) = linear hull (8, -2, -3,-1, -2) and 0^ = i(3Ai - 6A2 + 9A3 - X 4 ) . 

Moreover, from the first column it follows that is a parallel projection in the 
direction 

-4X^-X^-3X^ , 

which determines the point at infinity 0^. 

By this we have completely described one solution of the structure-motion 
problem. All other solutions are generated by letting run through d{X\\), i.e. 
the hyperplane 8aJ — 2 a^ ~ ^^3 — C(\ — 2 al = 0, with four degrees of freedom. 
One example of such an is (3, 3, 2, 6, 3). Then = a^q = (9,18,18,6,6) || 
(3, 6, 6, 2, 2), which gives the Chasles matrix 

- 24 24 3 15 ' 

- 12 - 3-6 12 

-2 -8 6 -24 

C{y^,y‘^,a^,a^)= -18 0 -2 0 . 

0-9 0-6 
8-4 0 0 

. 0 0-1 3 . 

After elimination of a 0^-component of the first image, and a 0^-component of 
the second, we obtain another Chasles matrix 

- 24 72 3 24 - 

-12 -18 -6 -6 

-2 -18 6 -6 

C{y^,y^,a^,a^)= - 18 - 18 - 2-6 . 

0 -18 0 -6 

8 0 0 0 

. 0 0 - 10 . 

Now all characteristics of the structure-motion problem, the shape of X as well 
as the focal points, can be read out: 

s{X) = linear hull (4, —1, —1, —1, —1) , 

0^ = — 3Ai -|- 1^2 -l- JX 3 + §Al4 , 

02 = 3X1 - 6X2 -f 6X3 - 2X4 . 
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Note that columns two and four are parallel, and that both describe the shape 
of X . This is what could be expected from the fact that the object configuration 
has not changed between the imaging instants. Also note that a^s(X) = s(A’||), 
in accordance with the discussion above. 



3.3 Relation to Fundamental Matrices and Multilinear Forms 



To fix the ideas, consider the case of 5-point configurations, and choose S- 
matrices so that 



w,{y\y^) 



>11 


^12 


9i?7ii 


9i?7i2’ 


1 


1 


— 2 


— 2 


V 21 


122 


mmi 


Q2m2 


1 


1 


— 2 


— 2 


mi 




mmi 


mm 2 


-1 


0 


-m 


0 


0 


-1 


0 


-m . 



By the discussion after Theorem the necessary and sufficient condition 
for geometric consistency is rankRjj = 5 — 4-|-2 = 3, or, equivalently, that all 
4 X 4-subdeterminants of Wq vanish. Consider for instance the subdeterminant 
obtained from the rows 1,2, 3, 4. Put = [vh V 22 ’nhV , C = VI 12 V 22 ^ 

is readily verified that the subdeterminant condition can be written 



= 0 with 



(p = 



0 Bi —B 2 
—Bi 0 R3 
B 2 — i?3 0 



diag((?i,<?2,93) > 



and 


1—2 




1—2 




1—2 


Bi = 


»?3i mmi 

»?4i dwh 


, B2 = 


^21 <l2V21 
vh mvli 


, Ba = 


^11 mmi 
vh mvh 



This shows that = 0 is a necessary condition for the points and in 

the respective images to match. This is the classical epipolar constraint, cf. |2], 
and the matrix <1> is the fundamental matrix with respect to this particular choice 
of frames. The factorization of <P was discovered in 0, in a slightly different 
setting, where it was called the reduced fundamental matrix. In an analogous 
way, trilinear and multilinear forms appear by taking subdeterminants of Wq 
when m > 2. 

In the case of exact data, the statement that Wq has rank 3 is equivalent 
to the vanishing of a number of appropriately chosen subdeterminants, some of 
which can be interpreted as fundamental matrices. However, using such a finite 
family of algebraic conditions in the presence of noise, there is no longer any 
guarantee for the fulfillment of the rank condition. The same objection remains 
in the case of multilinear forms, and depicts a drawback of algorithms based 
on fundamental and multilinear forms. Another disadvantage is the coordinate 
dependency, which may require rules for coordinate normalization. 

All these problems are avoided by working with the matrix Wq and the 
Chasles matrix, where, loosely speaking, simultaneous and uniform averaging is 
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done over all conceivable constraints. One is lead to a viewpoint, where there is 
nothing special with the epipolar constraint compared to the other constraints 
that can be drawn from Theorem ^ except that contrary to most of the others, 
the epipolar constraint has a nice geometric interpretation. 

3.4 Proximity Measures 

Intending to work with linear algebraic methods instead of polynomial equa- 
tions, a quantitative tool for comparison of the correlation of linear subspaces 
is needed. In fact, the shape and depth theorem (single view) and the structure 
theorem (multiple views) both make assertments about the intersection of linear 
subspaces. In the single view case, the condition is that the intersection space 
of as{X) and s(3^) coincides with as{X), and in the multiple view case, the 
condition is that the n — 3-dimensional subspaces s(3^^), • . • , s(3^™) intersect in 
an n — 4-dimensional subspace. This leads to the formulation of the 

General problem: Measure the rate of c-dimensional coincidence between 
linear subspaces Vi, . . . , Tm of K” . 

To construct such a measure, let Py denote the orthogonal projection matrix 
onto V. Then there is a chain of equivalences, 

cc e 1 /l n . . . nKi + ■ • ■ + Pv^x) = x 

X eigenvector with eigenvalue 1 of M = ^ {Pv^ -I- ... -I- Py ^ ) . 

It follows that Vi , . . . , Vm intersect in a c-dimensional subspace if and only if the 
eigenspace corresponding to the eigenvalue 1 of M has dimension c. Hence the 
matrix I — M has rank deficiency c. A natural measure of this rank deficiency 
is the c:th smallest eigenvalue of I — M. An equivalent choice, more suitable for 
convergence studies of the algorithms below, is the following proximity measure: 

C 

. . . , Vm) = (^ where Ai < . . . < A„ are eigenvalues of / — M . 

In connection with single and multiple view geometry, as described by the 
shape and depth theorem and the structure theorem, V = s(J^) for some y. 
Taking an 5-matrix for y with orthogonal columns, the projection matrix can 
be written 55^. Using these theorems, a complication is the unknown depth 
parameters that appear. Violating slightly the orthogonality claim, in the case 
of single views below we work with the matrix 

M = ^{ASxS^A + SySy) with A = diag a , 
and in the case of multiple views, with the matrix 

M = — i^yiSy + Q 2 Sy 2 Sy Q 2 QmSy^Sy Qm) ) 

m 

with Qi = diag(q*), f = 1, . . . ,m. An analogous construction can be done with 
depth spaces instead of shape spaces, cf. 0 . 
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4 Algorithms for Projective Reconstruction 

4.1 Complete Data, No Occlusions 

By the discussion above, the problem is to determine kinetic depth vectors q so 
that the m-image analogue of m is fulfilled for some This can be done by 
the following algorithm, introduced in (E). A dual version, using depth instead 
of shape spaces, leads to factorization methods generalizing the one of |2D1, cf. 

0. The algorithm reads: 

1. take < 7 * = 1 for i = 1, . . . , m, 

2. compute an estimate of X by means of multiple view proximity, 

3. knowing an estimate of X, compute for each image i an estimate of the 
kinetic depth vector a* by means of single view proximity, and form the 
corresponding kinetic depth vector g*, 

4. goto 2 or STOP, according to some criterion. 

It can be shown that the sequence formed by the successively computed 
values of the proximity measure, (7rfc)“, decreases and convergences to a local 
minimum of tt, considered as a function of Also the successively 

computed kinetic depth values and reconstruction estimates converge. In the 
case of exact data, and sufficiently many images and points, there is a unique 
minimum, corresponding to the true values of kinetic depth and the true object 
configuration X . Empirically, the algorithm convergences very rapidly, in 10-20 
iterations. 




Fig. 1. Left loop: algorithm of Section Right loop: algorithm of Section ^21 



The convergence can be proved by observing that the minimization problem 
hidden in the proximity measure can be formulated 

inf||M(g2,... ,q™)-P||p 

where P runs through the orthogonal projection matrices of rank n — 4. This 
minimization problem can be studied by classical analytical methods. 
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4.2 Missing Data, Occlusions 

The algorithm above can easily be modified to handle also the the situation of 
missing data, i.e. when not all points are visible in all images. In this case the 
second step of the algorithm in Section lO is divided into two: 

3' knowing an estimate of X , compute by means of single view proximity for 
each image i a depth vector corresponding to points present in image i, 

3” compute for each image i the depth vector corresponding to the missing 
points, using that the total depth vector a® G d{X). 

The algorithms have been tested on images of a London scene, provided 
by Fraunhofer IGD within the CUMULI project. Six images have been used, 
taken at different locations on the river bank of the Thames. Forty points were 
manually detected in the images, where due to occlusions about 20 % of data 
was missing. The outcome is illustrated in FigureOand Figure0 left diagram. 

5 Using Subsidiary Information 

5.1 Coplanarities 

In man-made scenes, one often knows apriori that certain points are coplanar. 
The formalism of affine shape is well adapted to this situation. Suppose for 
instance that the first four points are coplanar. Then s{X) contains an element 
where all components except the first four vanish (cf. (i) in Example lA.il) . Hence, 
for a given 5'-matrix, there is a column vector 2 ; such that 



X 




/x ... 




X 




X . . . 


X 


X 




X . . . 


X 


X 




X • • • 


X 


0 




X • • • 


X 


_6_ 




U . . : 








In case of non-exact data, this can’t be expected to be fulfilled exactly. How- 
ever, by a least square argument, the S'-matrix can easily be adjusted to fulfill 
one or several coplanarity conditions. For instance, in the situation above, let ^ 
be the element in s(X) that is closest to the linear space U consisting of vectors 
with vanishing components 5, . . . , n, and let ^ be the projection of ^ on U. Let 
be the orthogonal complement of ^ in s{X). Then 0 is the subspace 
of Sq that is closest to s{X) in proximity measure tt. It is shape space of some 
configuration X', obeying the coplanarity constraint. In the same way, multiple 
coplanarity constraints can be handled. 

This leads to an algorithm, illustrated by the left hand loop in Figure E| 
yielding projective reconstruction of an object fulfilling a family of coplanarity 
constraints. 
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Fig. 2. Two of the six images used of a London scene. The symbols x denote 
the point configuration used, consisting of 40 points, with lots of occlusions. The 
circle symbols denote backprojection of reconstructed points, as if the scene had 
been transparent. 
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Fig. 3. Bird-eye perspectivities of the London scene, with placement of four 
points on the left building according to a city map. Left image: Projective re- 
construction without using subsidiary information. Right image: Affine recon- 
struction, corrected for coplanarities given by the walls, and parallelities given 
by roofs, walls, windows and ground. 



5.2 Parallelity 

Often one knows not only that certain points A, B, C, D are coplanar, but also 
that some lines AB and CD connecting them are parallel. If sufficiently many 
such parallelities are known, the projective reconstruction can be strengthened 
to an affine one. In fact, it is easily seen that (cf. (ii) in Example l^ni AB \\ CD 
if and only if 



G /3s(A’||) for some a, & . 



This leads to a linear system of equations in (3, from which the depth in the first 
image can be determined, yielding an affine reconstruction of the object config- 
uration X . If Euclidean coordinates of four points are known, then Euclidean 
reconstruction of the whole configuration is achieved. An algorithm is described 
by the left hand loop in Figure 0 The performance on the London images is 
illustrated in Figure El right diagram. 

5.3 Known Focii Locations 

By means of Theorem El it is also possible to use the affine shape formalism to 
make Euclidean hand-eye calibration, even in the case of uncalibrated cameras, 
provided that the focal points are known. In fact, knowing the affine shape of the 
configuration formed by five or more focal points, from the Chasles matrix the 



a 

—a 

b 

-h 

0 
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Fig. 4. Left loop: algorithm of Section o and Right loop: algorithm of 
Secion 15. ill 

depth in the first image can be computed, cf. jHI, by iterative solving of linear 
systems of equations. This is done by a similar argument as the one behind 
the algorithm in Section l5.ll and yields an affine reconstruction. When knowing 
Euclidean coordinates for the focal points, from the Chasles matrix also the 
location of all objects points can be computed, yielding Euclidean reconstruction. 
An algorithm is described by the right hand loop in FigureEl FigureEliUustrates 
the typical performance of the algorithm on simulated data, in a situation where 
the focal points are densely distributed far away from the object. It is interesting 
to note that the impact of image noise mainly consists in a translation of the 
object along the ray of sight, while its shape is preserved to a large extent. 





Fig. 5. Left diagram depicts focal points by squares and object points by circles. 
Right diagrams depicts true object points by circles and reconstructed points by 
crosses, in the presence of image noise. 
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Appendix 



A AfRne Shape and Depth 

In this section, a brief recapitulation of the definitions and the basic properties 
of affine shape and depth spaces is given. For more details and proofs, see H2|, 

m. ni> uni, m, ini> m- 



A.l Subspace Formulation 

Let A‘^ denote an affine space of dimension d, where the cases d = 2 and d = 3 are 
of particular interest. In our approach, the primitive objects are not individual 
points of A'^, but point configurations, 

X = {X\... ,X”), where X'^ G A‘^, k = I, . . . ,n . 



By the dimension px of X is meant the dimension of the smallest affine subspace 
containing X . Let the set of n-point configurations of dimension p be denoted 
Cn,p- For instance, Cn ,2 consists of n-point configurations in which are planar 
but not linear. 

The main idea of the approach of affine shape is to peel off any dependency 
of the coordinatization of A‘^ on the parametrization of Cn,p- To construct such 
a parametrization, consider two different coordinate representations x and x on 
A“^. Here x = Bx + b, where B is a non-singular d x d-matrix, and 6 is a column 
matrix. For a given configuration X, to the respective coordinate representations 
we associate matrices with the coordinate vectors as columns, but augmented 
with a row of ones. 



Then 














1 . 


. 1 


, Xa = 


1 . 


. 1 



A, = AX, 



with 



A = 



B b 
0 1 



( 1 ) 



In this way, the set of augmented coordinate matrices is partitioned into 
equivalence classes, each of which can be identified with one particular point 
configuration. The problem is to label these equivalence classes. This is done by 
means of the two consequences of (HJ: 

Af{Xa) =Af{Xa) , TZrowiXa) = TZrow(Xa) , 



where Af stands for ‘nullspace’ (column) and TZrow for ‘row space’. We have seen 
that these linear subspaces discriminate between point configurations. On the 
other hand, one readily verifies that if Af{X) = Af{X) or 7?.row(A) = TZrow{X), 
then there exists an affine transformation A such that Xa = AXa- This shows 
that the linear subspaces Af{Xa) and 7?.row(Aa) stand in a one-one correspon- 
dence with the set of point configurations. Since they are independent of the 
coordinatization of A'^, the following definition makes sense. 
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Definition A.l. Let Xa he an augmented coordinate matrix for X G Cn,p with 
respect to some coordinate system. Then 

the affine shape space of X, denoted s{X), is defined by s{X) = Af(Xa), 
the affine depth space of X, denoted d{X), is defined by d{X) = 7?.row(Aa). 

Often we use abbreviated denominations, saying e.g. ‘affine shape’, ‘shape 
space’ or simply ‘shape’ instead of ‘affine shape space’, analogously for depth. 
The denomination ‘affine depth’ will be motivated in Remark ^ below. Affine 
shape also has an interpretation in terms of barycentric coordinates, cf. e.g. m 
The discussion above is summarized in the following theorem, saying that 
affine shape and affine depth are complete affine invariants. 

Theorem A.l. Let X,X G Cn,p- The following statements are equivalent: 

- X and X can be mapped onto each other by an affine transformation, 

- s(X) = s{X), 

- d{X) = d{X). 

To continue, some further notations are needed. Let 

n 

ro = {^e(6,... ,Cn)eK" | ^^^ = 0}, 

and let a multiplication on R" be defined by 

= (oi^i, . . . , Oiuf,7i) If O = (oi, . . . , Oiri), f — (^1, ■ • ■ , fn) • 

In the same way, division ^/a is defined by componentwise division, provided 
that Oi 7 ^ 0, i = 1, . . . ,n. We use the notation a = 1/a, where 1 = (1, . . . , 1) G 
R". Finally, for the situation in Theorem lA.Il we use the notation 

X = X X and X have equal shape . 

The following example is crucial for some common kinds of subsidiary infor- 
mation. It shows that shape spaces mirror a lot of qualitative information about 
point configurations. 

Example A.l. 

(i) Let X = (X\... , A"). Then the sub-configuration {X '^^ , X'^L 
is planar if and only if 

fkiX^^ + • ■ • + = 0 with -|- . . . -I- fki = 0 • 

Hence s{X) contains an element f = (^i, . . . where all components except 

Cfei , ■ ■ • , Cfe 4 vanish. 

(a) One readily verifies that in (i), the vectors X^^X^'^ and X^^X^'^ are 
parallel if and only if = -fk^, Cfca = In particular, {X '^^ , 

forms a parallelogram if and only if with two 

positive and two negative signs. 

(Hi) It can be shown that the vector X^^X^^ is parallel to the plane spanned 
by the points A^^ A^^, A^'^ if and only if s(X) contains an element C = (Ci; ■ • ■ ; 
^„), where all components except fki, - ■ ■ ,£,ka vanish and fki = —fk^- 
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Theorem A. 2. Let X C Cn,p- Then 

— the affine shape space fulfills 

• dim s(A’) = n — p — 1, 

• s(A) C So, 

— the affine depth space fulfills 

• dimd(A’) = p+1, 

. 1 = ,l)Gd(A), 

— the affine shape and depth spaces are connected by 

• d{X)s{X) = 0, 

• s{X) 0 d{X) = ffi”. 

The theorem says that the generic dimension of s{X) is n — 3 for 2D- 
configurations, and n — 4 for 3D-configurations. In the same way, the generic 
dimension of d{X) is 3 for 2D-configurations and 4 for 3D-configurations. 

By the last item, the shape and depth spaces of an n-point configuration 
are orthogonal complements of each other in ffi”, s(ff)° = d{X), d{X)^ = s{X). 
Although this makes one of them seem superfluous, it is practical to use them in 
parallel since they embody different aspects of the geometry, with shape space 
directed on point configurations and depth spaces on transformations. 

The following theorem generalizes Theorem lA. 1 1 to non-singular transforma- 
tions, typically from 3D to 2D. 

Theorem A. 3. Let X € Cn,p and X G Cn,p-i- The following statements are 
equivalent: 

— X can he mapped onto X by an affine transformation, 

— s(A) C s(X), 

— d{X) C d{X). 



A. 2 Matrix Formulation 

To make numerical computations, matrix representations of the linear spaces 
s{X) and d{X) are needed. 

Definition A. 2. Let X G Cn,p- Then 

— by an S -matrix of X is meant a matrix with column space s{X), 

— by a D-matrix of X is meant a matrix with row space d{X). 

Note that Xa is a D-matrix of X . The following example illustrates some 
typical computations with 5'-matrices. 



S = 



2 14 

-2 10 
0 -2 -4 
2 0 2 
-1 -1 -2 
-1 1 0 



Example A. 2. Let 
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We claim that this is an 5'-matrix of some configuration X G Ce, 2 - In view 
of Theorem E21 first note that the matrix fulfills the necessary conditions of 
having rank 3 and vanishing column sums. The column space is unaffected by 
multiplication from the right by a non-singular matrix. In particular, using the 
inverse of the submatrix of S formed by the rows 3, 4 and 5, we obtain a new 
matrix with the same column space, 



■ 2 1 4‘ 






- i-1 0- 


-2 10 




■ 0 -2 -4‘ 


-1 


i 1 0 


0 -2 -4 




2 0 2 





-10 0 


2 0 2 




.-1 -1 -2. 




0-1 0 


-1 -1 -2 








0 0-1 


-1 1 0 






. 011 . 



The corresponding elements of the columns of the matrix on the right hand 
side provide the barycentric coordinate representations for the points 
and X®, respectively, with respect to the affine frame X^,X^,X^. Here in fact 
more can be said. Thus the first column says that. 









which means that X^,X^,X^ are collinear, and that X^ is the centroid of X^ 
and X^. The second column says that 



-X^ + X^-X^ + X'^ = 0 , 



which means that X3,X^,X3,X“^ are vertices of a parallelogram. Finally, the 
third column says that 

-X^-pX® = 0 , 

i.e. that the points X® and X® coincide. 

So far, we have avoided ‘points at infinity’. The following example illustrates 
how they can be treated within the framework of affine shape. 

Example A. 3. Let X^,X^,X3 be three fixed points, and let X^ be defined by 



X1X4 = u;^ 2X1X2 -h < 3 X 1 x 3 . (2) 

Then s{X) is a one-dimensional subspace of generated by the vector (1 — 
w ^2 — w^ 3 ,w^ 2 ,w^ 3 , —1)- Letting w — s- 00 , by (0 we are led to interpret X^ as 
the point at infinity in the direction ^ 2 X 1 x 2 -p ^ 3 X 1 x 3 . Taking limits also of 
s(A’), where X = (Xi, . . . ,X‘i), one finds that 



= {(^i)C 2 ,C 3 , 0 ) with Cl -I- ^2 + ^3 = 0} . 
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A. 3 Relation to Grassman Manifolds 



During the last few years, Grassman-Cayley and exteriour algebra has attracted 
some attention in computer vision, cf. e.g. The discussion below aims at 
explaining the place of affine shape and depth in this context. 

By Theorem 1701 every point configurations X obeys the inclusion s{X) C 
Sq. Conversely, every linear subspace U of Sq is shape space for some point 
configuration. In fact, if dim U = a, in the same way as in Example lA .21 it is 
seen that U = s{X) for some configuration X of dimension px = n — a — 1. By 
Theorem IA..'tl we thus have a one-to-one correspondence between linear subspaces 
of Sq and point configurations modulo affine transformations, 

C„,p/aff ^ G{So, n- p-l) , 

where G{So,d) denotes the Grassman manifold consisting of all d-dimensional 
linear subspaces of Eq C R". 

The connection to Grassman algebra can be made more precise. For instance, 
let A be a planar 4-point configuration, with an augmented coordinate matrix 
Xa- By definition, the S'-matrix, which in this case only has one column, is 
obtained by solving for the nullspace of Xa- Cramer’s rule yields the components 



^ijk — det 



Xi Xj X]^ 

1 1 1 



(3) 



These are recognized as Pliicker coordinates of the subspace s{X) in E'^. The 
same holds true for bigger configurations. For instance, if A is a 5-point config- 
uration, then all x:es in the S'-matrix 



Sx 



X X 
X X 
X X 

X 0 
0 X 



are Pliicker coordinates for s{X). 

From this we learn that in uncalibrated camera geometry, it is only param- 
eters of the form m that matter. In fact, regardless of the choice of coordinate 
system, image data organize themselves into such packages. An important fea- 
ture of the approach by affine shape is that the array structure of the S'-matrix 
adds further geometric information, compared to the Pliicker coordinates alone. 



B Projective Transformations and Spaces 

B.l Background 

By a perspective transformation P : A^ — > A^ with focus ft and image plane 
n is meant a transformation such that every point X G A^ is mapped to the 
point of intersection of II and the line (j)X . Perspective transformations are used 
to model the pinhole camera. Whenever the focus is of interest, the notation 
will be used. 
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To deal with perspective transformations and their compositions, the projec- 
tive transformations, to get a coherent theory one has to adjoin points at infinity 
to the ambient affine space. As described in Example O such points can be 
interpreted as directions in A^. This gives a model for the d- dimensional pro- 
jective space P"^. If ^ is a point at infinity, then is a parallel projection in the 
direction described by cj). 

If F = P^{X), then fiX = acfY for some a, called the depth of X with respect 
to y. If (/) is a point at infinity, the depth is by definition I. If a configuration X is 
mapped onto a configuration 3^ by a perspective transformation, and the depth 
of X^ with respect to is fc = 1, . . . , n, then the vector a = (oi, . . . , a„) G 
R” is called the depth vector of X with respect to y. 

B.2 Shape, Depth, and Projective Transformations 

The following theorem gives a complete description of the single view geometry. 
Theorem B.l. Shape and depth theorem. 

(i) If X,y G Cn,p, then the following statements are equivalent: 

• there exists a perspective transformation P, such that P{X) = y 
depth vector a, 

• as{X) = 5(3^), 

• ad{y) = d{X). 

(ii) If X G Cn,p, y G Cn,p-i, then the following statements are equivalent: 

• there exists a perspective transformation P, such that P{X) = y 
depth vector a, 

• as{X) C s{y), 

• ad{y) C d{X). 

Remark 1. Now the terminology ‘affine depth space’ can be motivated, giving 
the answer to the question: Which depths can occur in conjunction with XI 
From Theorem rrn and the fact that every subspace of Aq is a shape space, it 
follows that a is the depth of a perspective mapping acting on X if and only if 
as{X) C So, i.e. a G s(A’)°. Since, by Theorem IA.2L s(A’)° = d{X), the name 
‘depth space’ for d{X) is adequate. 

B.3 Location of Focal Point 

By Theorem IH. If there exists a projective transformation from 3D to 2D, P^p : 
X ^ y, with depth a, if and only if there holds a strict inclusion between 
linear subspaces, as{X) C s(3^). According to the following theorem, there is a 
one-to-one correspondence between (j) and the set-difference s(3^) \ as{X), and 
it is possible to express 4> in terms of X by an explicit formula. In formulating 
the theorem, a degenerate case called ‘flat projection’ has to be singled out, 
for details see PH. If X is an n-point configuration and (j) a point, then {X, (f>) 
denotes the n-\- I-point configuration formed by adjoining as an n-|- l:th point 
after the points of X. 



with 



with 
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Theorem B.2. Focal point theorem. Suppose that as{X) C s(3^), where a C 
K", T S Cn,p, and y S Cn,p-i- Then y = P<f,{y) with a non-flat projection 
if and only if 



4> = '^akr]kXk/'^akr]k , (4) 

k=l k=l 

with rj G s(y) \ as(A’). Analogously when f is a point at infinity. In either case, 
the compound configuration has an S-matrix 



S(x,4,) 



diag (a)Sj; 
-a'^Sy 



(5) 



B.4 Depth, Shape, and Camera Matrices 



With X, y denoting object and image coordinates, respectively, the camera matrix 



P fulfills P 



= A 



. For point configurations, it follows that 



PXa = YaA , with A = diag (Ai, . . . , A„) . 

The camera matrix of course depends on the coordinate systems used for the 
scene and the images. 

From the equation PXa = YaA one reads out that each row of YaA is a 
linear combination of the rows of Xa, with coefficients from P. It follows that 
Xd{y) C d{X), where A = (Ai,... ,A„). Conversely, if Xd{y) C d{X) it can 
be shown that there exists a matrix P such that PXa = YaA. This depicts 
the connection between camera matrices and depth and shape spaces, and that 
A = a is the depth vector of the camera transformation. 

Working with camera matrices, it is well known that the focus is obtained as 
the nullspace of the camera matrix. Theorem Ih. 21 gives a novel characterization, 
having the advantage of providing an explicit formula for the focal point in terms 
of the object X. To see the connection, take rj G s(3^). From PXaA~^ = Ya, it 
follows that PXaA~^rj = YaP = 0, which shows that XaA~^rj belongs to the 
nullspace of P, and thus is a focal point. 



From Ordinal to Euclidean Reconstruction with 
Partial Scene Calibration 



Daphna Weinshall^, P. Anandan^, and Michal Irani^* 

^ Institute of Computer Science Hebrew University 91904 Jerusalem, Israel 
daphnaOcs . hu j i . ac . il 

^ Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA 
anandanOmicrosof t . com, 

^ Dept, of Applied Math and CS, The Weizmann Inst, of Science, Rehovot, Israel 
iraniSwisdom . weizmann .ac.il 



Abstract. Since uncalibrated images permit only projective reconstruc- 
tion, metric information requires either camera or scene calibration. We 
propose a stratified approach to projective reconstruction, in which grad- 
ual increase in domain information for scene calibration leads to gradual 
increase in 3D information. Our scheme includes the following steps: (1) 
Register the images with respect to a reference plane; this can be done 
using limited scene information, e.g., the knowledge that two pairs of 
lines on the plane are parallel. We show that this calibration is sufficient 
for ordinal reconstruction - sorting the points by their height over the 
reference plane. (2) If available, use the relative height of two additional 
out-of-plane points to compute the height of the remaining points up to 
constant scaling. Our scheme is based on the dual epipolar geometry in 
the reference frame, which we develop below. We show good results with 
five sequences of real images, using mostly scene calibration that can be 
inferred directly from the images themselves. 



1 Introduction 

Given multiple images, stratified 3D reconstruction can be obtained depending 
on the available camera and scene calibration. In general uncalibrated images 
permit only projective reconstruction 0, which is of limited use; for example, we 
cannot determine from projective structure which part of the object is in front 
of the other. Its topological nature makes this representation useful primarily 
for verification, e.g., object recognition; for most other applications some scene 
or camera calibration is needed. 

With calibrated cameras, or if self calibration is possible (when the same 
camera with partially fixed internal parameters is used to obtain all the im- 
ages), Euclidean reconstruction can be obtained uni. Alternatively, active vi- 
sion techniques, based on imposing constraints on the viewing geometry and/or 

* MI and DW are supported in part by DARPA through ARL Contract DAALOl-97- 
K-0101. This research was done while DW was on sabbatical at NECI Princeton. 
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the camera motion, can be used. Such externally imposed constraints may sim- 
plify the problem enough to permit affine or Euclidean reconstruction, e.g., m 
(see also m)- However, active vision techniques cannot be universally applied, 
and in particular they are of little help when the sequence of images is already 
given (such as the case with video analysis). Similarly, techniques which obtain 
Euclidean reconstruction using self camera calibration cannot be used with just 
any sequence unless it is known that some of the internal parameters of the 
camera taking the sequence were kept fixed or are known. 

The alternative to active vision and self camera calibration is to use scene 
calibration. Thus projective reconstruction can be turned into Euclidean recon- 
struction if the 3D coordinates of five points are given PI- Computing Euclidean 
reconstruction from affine reconstruction requires that the 3D coordinates of 
four points are given. These results were used to formulate a stratified approach 
to reconstruction, characterized by invariance to increasingly smaller groups of 
transformations in TZ^: projective, affine and similarity (scaled Euclidean) 0. 
The usefulness of techniques using scene calibration is limited to cases when the 
needed 3D information is available. Unlike the case with camera calibration and 
active vision, there are no results on partial scene calibration which could give 
partial metric information. 

Our approach fills in this gap (a related approach is independently described 
in 0); we investigate partial scene calibration which can be used to obtain some 
metric information from uncalibrated images. Unlike previous approaches using 
scene calibration, in which the 3D information had to be given a-priori, we make 
use of scene information that can be inferred automatically from images, such 
as that lines are parallel. This information only permits ordinal reconstruction 
in our scheme. In addition, in order to achieve affine reconstruction we need to 
know the relative height of one 3D point; this requirement still seems easier to 
meet than the knowledge of the 3D coordinates of five 3D points, as required in 

IP- 

More specifically, we compute non-invariant reconstruction in a special co- 
ordinate system. This coordinate system is defined relative to a physical (real 
or virtual) planar surface in the scene. The projection of the scene points onto 
an input camera image is decomposed into two stages: (i) the projection of each 
scene point through the focal-point of the camera onto the reference plane, and 
(ii) the re-projection of the reference plane image onto the camera image plane. 
The projection from 3D to the reference plane depends only on the 3D positions 
of the scene points and the focal-point of the camera. All effects of the camera’s 
internal calibration and the orientation of the image plane are folded into the 
re-projection step, which is captured by a homography between the reference 
plane and the camera image plane. 

The reconstruction of the 3D scene is done relative to the reference plane, 
within a Cartesian coordinate system whose X — Y axes span the reference 
plane, and whose Z direction is perpendicular to it. Given multiple images of 
the same scene, the homography relating each image to the reference plane is de- 
termined by specifying a few pieces of geometric information about the reference 
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plane. This homography is then applied to each camera image to determine the 
corresponding reference-plane image. The “Height” (Z) of each scene point is 
determined by analyzing the disparity between the positions of each scene point 
on the multiple reference plane imaged The basic relationships associated with 
multi-view parallax geometry (with respect to a reference plane) are described 
in our recent paper |p. However, while [Q focuses on the geometric relationships 
and elaborated on one application (namely “new view synthesis”), the present 
paper focuses on the use of this framework for stratified reconstruction, based 
on partial scene calibration. 

The specification of the geometric information about the reference plane 
amounts to “registering” the reference plane. Such registration, with no 3D 
calibration, is sufficient to determine whether points lie on the same side of 
the plane m- Here we show that ordinal information about the 3D scene can 
be obtained even with very little scene calibration data, e.g., we can compute 
height ordering with respect to a plane from knowing (or guessing) the existence 
of two pairs of parallel lines on the plane; we call this ordinal reconstruction 
(cf. H21). By providing additional domain-information (e.g., heights of one or 
two out-of-plane points), we can gradually obtain more metric 3D information, 
achieving affine and Euclidean reconstruction. 

2 Geometry on the Reference Plane 

The perspective projection of a point Pi in space to an image plane can be 
written as pu = MtPi, where pa and Pi are the point homogeneous coordinates 
in 2D and 3D respectively, and Mt is the 3x4 projection matrix describing the 
camera t. Mt depends on the orientation of the image plane and the location of 
the camera center Pt. 

Here we break down the projection into 2 operations: the projection of the 
3D world onto a 2D reference plane II through the focal-point Pt, followed by a 
2D projective transformation (homography) which maps the reference plane II 
to the image plane of camera t. 

Our purpose in this paper is to use this decomposition for the gradual re- 
construction of the 3D scene relative to II. The key idea in this paper is that 
by analyzing the images formed by projecting the scene through each camera 
center onto the reference plane II, we vastly simplify the problem of 3D re- 
construction. As explained in Section 0| the reference plane images from each 
camera view are obtained by registering each image to a pre-defined (affine or 
Cartesian) coordinate system on the reference plane. 

2.1 Reference Plane Coordinate System 

Let n denote a (real or virtual) planar surface in the scene. We call II the 
“reference plane”. We define a 3D Cartesian coordinate system whose X — Y 

^ A related approach called “plane-l-parallax” was taken in but there 3D re- 

construction was relative to the coordinate systems of both the reference plane and 
a reference image. 
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axes span the reference plane 7T, and the Z direction is perpendicular to H. We 
call this system the “reference plane coordinate system” . 

An object, composed of n points in 31? space, is represented in the reference 
plane coordinate system by the shape matrix P, the following 4 x n matrix: 

0 a± 02 03 04 Xi Xi^i 

p _ 0 bi b2 bs bi • ■ ■ Yi • • • , , 

1 0 0 0 0 Z, '' '' 

_0 1 1 1 1 W, W,+i 

Columns [oi bi 0 1] are the coordinates of points on the reference plane IT, 
[0 0 1 0] is a point at oo on the line through the origin which is perpendicular to 
the reference plane, and columns [Xi Yi Zi are the coordinates of general 
points Pi on the object. 

Every object can be represented this way; but the representation is not unique 
- a transformation from another projective coordinate system to this particular 
one is determined only up to an independent scaling of the Z axis. This ambiguity 
comes about because the standard projective basis requires that no four points 
are co-planar, whereas the basis of our system includes a plane. We need not 
worry about this ambiguity in the following derivations, however, because all 
quantities of interest involve ratios of Z values. 



2.2 Projection on the Reference Plane 

The 3D scene points are first projected onto the reference plane II through the 
focal-point Pt of the camera. This forms a “virtual” image on II. We refer to 
this image as the “reference-plane” image. 

Corresponding to the object-shape matrix P defined in (0 and camera t, we 
define the following matrix pt of reference-image points: 



Pt 



Oj2 U3 U4 Xit 

/3t hi &2 &3 ■ Vit y{i+l)t ■ ■ ■ 

7i 1 1 1 1 Wit W(i+I)t 



Note that the first two coordinates of the points 1, . . .4, which are physically 
located on the reference plane II, are the same as the corresponding coordinates 
in their 3D representation. This is a direct consequence of our choice of the 
coordinate system for the 3D representation. 

As explained before, the coordinates of the image points in the “original” 
input images are obtained by applying a 2D projective transformation to the 
corresponding reference plane images. That is, qt = where qt is the “image 

matrix” corresponding to the original image t, and Ht is the 2D projective 
transformation (3x3 matrix) that relates the reference plane to the image plane 
of camera t. 

As will become clear later, our stratified approach to 3D reconstruction does 
not require that the reference plane image coordinates be completely known. 
It is sufficient to specify them up to 2D affine deformation of the reference 
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plane U. Thus, it is sufficient to specify pt = GiPt where Gt 



9ii 912 9i3 
921 922 923 
0 0 1 



denotes the (possibly unknown) affine transformation of the reference plane. In 
the remainder of this paper, we call pt the “fully normalized” reference plane 
image, and pt the “affine normalized” reference plane image. 

If image pt is fully normalized, it can be shown (or, more easily, verified) 
that the projection matrix M* associated with it is: 



Mt = 



St 0 at 0 
0 St Pt 0 
0 0 -ft St 



We impose the constraint that the focal point of the camera cannot be pos- 
sibly seen in the (reference-plane) image; thus it is the null vector of Mt- We 
denote the 3D coordinates of the focal point Pt of the camera t by [Xt Yt Zt Wt], 
then 



'o' 




fy 0 at 0 




'Xt' 

Yt 


f 0 = StXt + atZt 


0 


cx 


0 St Pt 0 




1 ® /3*Zt 


0 




_ 0 0 -ft St _ 




_Wt_ 


[ 0 = StWt + -ftZt 



and the solution is: 

[St at Pt It] oc [-Zt Xt Yt Wt] 

The normalized reference-plane image pt is therefore obtained by the follow- 
ing projection: 



Pt cx 



-^t 0 Xt 0 

0 -Zt Ft 0 
0 0 Wt -Zt 



P 



Let Pit = [xit Hit Witl’" denote the reference-plane image position of the object 
point Pi projected through the camera focal-point Pt (onto 7T). Then: 





Xit 




o 

o 

1 




\z. 


Pit = 


Vit 

_Wit_ 


CX 


1 

o o 
1 




Z^ 

_w_ 



Observe that each point on the reference-plane image imposes 2 constraints 
relating to the coordinates of the object point Pi and the camera center Pt, 
which follow immediately from 0: 

Xit _ —ZtXi + XtZi Uit _ —ZtYi + YtZi 

^ “ -ZtW + WiZ,’ ^ “ -ZiW + WZ, 

In this equation, the 3D coordinates of the 3D point Pi and the focal point Pt of 
camera t are dual: the focal point of the camera [Xt Yt Zt Wt] is interchangeable 
with the 3D point [Xi Yi Zi Wfy 

From 0 we get the epipolar and dual-epipolar geometry on the reference 
plane. 
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2.3 Epipolar Geometry on the Reference Plane 

The epipolar geometry is obtained by the elimination of the 31? point coordinates 
from In the following derivation we follow the method described in 0. First, 
we rewrite O) and note that each camera t imposes 2 constraints on the 31? point 
coordinates 

^ 0 

0 WitZt {uitWt - WitYt) -yuZt ) 



X, 

Y, 
Z^ 

w. 



= 0 



Two cameras (e.g., t and s) give 4 constraints. Since they have a non-trivial 
solution [Xi Yi Zi Wi], the determinant of their matrix must equal 0. This 
yields a relation between the focal points Pt = [Xt Yt Zt Wt] and = [Xg Yg 
Pit = [xit Pit Wit] and pig = [xig yts of the point Pi in the two corresponding 
reference-plane images. This relation can be rewritten as: 



Pit Xts Pis — 0) 



Fts = 



0 ZgWt-ZtWg -ZgYt + ZtYg 

-ZgWt + ZtWg 0 ZgXt - ZtXg 

ZgYt - ZtYg -ZgXt + ZtXg 0 



(4) 



and Ftg is the “fundamental” matrix. 

gives the epipolar geometry on the reference plane — a relation 
between the coordinates of point Pi in the reference-plane image from camera t 
to its coordinates in the reference-plane image from camera s. The “fundamental 
matrix” Ftg is anti-symmetric, with essentially 3 unknowns: {ZgXt — ZtXg), 
{ZgYt — ZtYg) and {ZgWt — ZtWg). We also observe that these are the coordinates 
of the point pts, which can be obtained by substituting the object point Pi by 
the camera-center Pg in 0- Thus, Pts, which determines the fundamental matrix 
Fts, is the projection of the focal point Pg through Pt on the reference plane — in 
other words, pts is the epipole. 

The two epipoles, Pst and pts, are the same on the reference plane; they are 
defined by the intersection of the line going through the focal points of the two 
cameras and the reference plane. Thus the epipolar geometry on both image s 
and image t, when mapped to the reference plane, is the same. This remains 
true even if the images are not normalized at all, that is, they are aligned with 
the reference plane up to some homography only. 



2.4 Dual Geometry on Reference Plane 

The dual epipolar geometry is obtained by eliminating the coordinates of the 
camera’s focal point from Q. Once again, we first rewrite (0) so that each point 
Pi imposes 2 constraints on the coordinates of the focal point of the camera t\ 



f WitZi 0 {Xit\Yi WitXi) XitZi\ 

y 0 WitZi {pitWi - WitYi) -pitZi ) 



Xt 

Yt 

Zt 

Wt 



= 0 



214 



Daphna Weinshall, P. Anandan, and Michal Irani 



Two points give 4 constraints, which have a non-trivial solution. Following 
the same reasoning as before for points Pi,Pj, we get a relation between the 
3D coordinates of the two points \Xi Yi Zi Wi], [Xj Yj Zj Wj] , and the image 
coordinates pu,Pjt of the points in image t. This relation can be rewritten as 



Pit ^ij Pjtj 



G^,= 



0 ZjW^-ZiWj -ZjY, + Z{Yj 

-ZjW, + ZiWj 0 ZjX, - ZiXj 

ZjYi — ZiYj — ZjXi + ZiXj 0 



(5) 



and Gij is the dual “fundamental” matrix. 

Similar to the case of the fundamental matrix F, the dual fundamental Matrix 
G is determined by the coordinates of the dual epipole pij [in| , which is obtained 
by projecting the scene point Pi through Pj onto the reference plane. These dual 
relations are similar to those previously described in pun, where they were 
derived with respect to the quantities from the actual image points; here the 
dual relations are derived in the context of the reference-plane images. 

2.5 Using AfRne Normalization 

When the homography Ht between the actual observed image qt and the cor- 
responding virtual image pt on the reference plane is completely determined by 
the given scene calibration data, then for each point in the observed image j we 
can compute the corresponding reference-image point coordinates pu = FPj~^qit- 
These quantities can then be used in the constraints derived above. However, if 
the homography is known only up to a 2D affine transformation Gt of 77, then 
we can use the affine normalized coordinates pu = GtPa ■ 

If we replace pu by pa in all the derivations above, everything remains true 
but with respect to a different 3D coordinate system: the 3D point coordinates 
are now taken with respect to a 3D coordinate system where the X — Y plane is 
transformed by the same affine transformation G*. The coordinates of point Pi 
in this new coordinate system are [Xl,Y- ,aZi,Wi], where X[,Y( are obtained 
from Xi, Yi by G*, and a is some scale factor. Thus when using image coordinates 
from Pt, we can still use all the expressions developed in Section tZ.'Zl noting to 
replace (Xt,U) by (X-,Y-). 



3 Stratified Reconstruction 

In this section we show how the relations established in Section |5| between the 
image positions of scene points on the reference plane can be used to recover 
3D information about the scene with very little scene information. We will de- 
scribe an approach in which gradual increase in domain information for scene 
calibration leads to gradually increasing 3D information. 

Registering the Input Images to the Reference Plane 77: As mentioned in 
Section Hour approach is based on registering each input image to a pre-specified 
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affine or Cartesian coordinate system on the reference plane. For practical rea- 
sons, we do this in three stages: (i) determine the homography Hgt between each 
image “s” in the sequence and an arbitrarily selected reference image “t” from 
the same sequence, (ii) specify or infer the domain information needed to reg- 
ister the reference image t to the reference plane U; based on this specification 
determine the 2D transformation Ht-n that aligns image t with 7T, and (iii) con- 
catenate Hst and Hf^ to determine the transformation Hs-k that registers s with 
n . We refer to this process of registering the images to the reference plane as 
“registering the reference plane” . 

AfRne vs. Euclidean Normalization: As noted in Section if the positions 
of the image points on the reference plane U are known only up to a 2D affine 
transformation of U, it only affects the X and Y coordinates of the 3D scene 
points. The Z component is not affected by the unknown 2D affine transfor- 
mation G(. Here the minimal scene calibration required for some 3D inference 
should be sufficient for the registration of the reference plane up to an affine 
transformation . 

3D Reconstruction from the Dual Epipolar Geometry: © establishes the 
relationship between reference-plane image positions of two points in one view 
and the dual Fundamental matrix G, which in turn depends on the coordinates 
of the dual-epipole. Since this equation is homogeneous, it can be divided by 
an arbitrary scale factor. In particular, we can divide © by ZiZj and obtain a 
new form for G, which depends on the scaled homogeneous coordinates of the 
dual-epipole 



d^f 

^'^3 ^ '7 '7 ^ '7 '7 ^ '7 '7 

7/j Zji 7/j Zji Zj j Zji 

© provides one constraint on these 3 variables. Looking separately at the pairs 
of 3D points {Pi,Pj}, {Pi,Pfc}, {Pj,Pk\, we get from each image 3 constraints 
on the 9 homogeneous coordinates of the three dual-epipoles Pij,Pjk and pki- 
Thus, 2 images <=1,2 give 6 homogeneous constraints on these 9 variables. In 
addition, it can be easily verified from © that pij +pjk +Pki = 0 which gives 3 
more constraints. Taken together these 9 homogeneous constraints are sufficient 
to compute the coordinates of the three dual-epipoles up to a single scale factor. 

Note that each additional image provides 3 more homogeneous constraints 
on the same 9 variables. Thus, if given more than 2 images, the computation 
of the dual epipoles requires the solution of an over-determined linear system 
of equations; such system can be solved in a least-squares sense using standard 
tools such as SVD. 

Of particular interest is the third coordinate of the dual-epipoles. Assum- 
ing, w.l.o.g, Wi = 1, the third coordinate of the three dual-epipoles are: — 

Since we can compute these quantities only up to a 
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scale factor, we can only determine their ratios. We arbitrarily select two out- 
of-plane points as “reference points” and denote them by indices 1 and 2. Then 
for any other point i we can compute 



Stratified Reconstruction: OZJ can be used to compute the height of points 
relative to the reference plane U. With increasing amount of scene information 
we get increasing specificity: 

Ordinal reconstruction: from (O it follows that Ui is monotonically related 
to the height Zi. Hence, given only the knowledge whether a is positive 
or negative, the ordering of the height of all other points relative to II 
can be determined without any additional information. This type of 3D 
information is useful in a number of visual reasoning tasks such as navigation 
and grasping, in order to determine potential obstructions or provide micro- 
management control commands. 

Affine reconstruction: if the ratio {Z 2 /Z 1 ) is given, a can be determined, and 
the height of all other scene points relative to II can be determined up to 
an unknown scale factor. 

Absolute depth: if, in addition, the height of Z\ is known, then the absolute 
height of all points relative to II can be determined. 

Euclidean reconstruction: the remaining elements of the dual epipoles can 
be used to compute the X and Y coordinates of the object points. Given 
the height of Zi,Z 2 , the X,Y coordinates can be determined up to image 
translation. Given (Ai,Fi), (Xi,Yi) can be determined absolutely. 

4 Experiments 

We compute the stratified reconstruction using four sequences of real images. 
In the first three cases, corners were automatically extracted (see Fig [IJd) and 
then automatically tracked over the sequence; for comparison we were given all 
the 3D coordinates of the points. In the last two sequences, a reference plane 
in the scene was stabilized first, then a dense parallax flow field was recovered 
|1 DU ,'Ij . The reconstruction was then based on this parallax data. In this case, 
the images were obtained using a hand-held video camera in a casual manner, 
without making controled measurements of the camera or scene parameters. 

4.1 Medium Depth Sequence 

An object (box) 15cm wide at about 60cm from the camera (FigDt) was pho- 
tographed from 5 different points of view. First, we affine-normalized all the 
images: using two identified pairs of parallel lines (e.g., the sides on one of the 




(7) 



where a = 4^ — 1. 
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a) b) c) d) 



Fig. 1. a) one frame from the medium depth sequence; b) corners extracted from a); 
c) one frame from the large depth sequence; d) one frame from the lab sequence. 



faces of the box), we applied a 2D homography which made the two lines parallel 
in the image and confined to the same image position in all the frames. We then 
arbitrarily selected two out-of-plane points as reference points, and computed 
the dual-epipole between them, as well as the dual-epipole between each of the 
other data points and the two reference points. The dual-epipoles were com- 
puted by using all the 5 frames and a least-squares solution to ©. Using these 
dual-epipoles, we computed ordinal reconstruction as described in Sectional The 
results - each computed Ui value vs. the real Ui value - are shown in Table Q 



Ui est. 


Ui act. 


12.9521 


12.6736* 


8.5741 


6.4453 


7.7989 


6.2027 


2.5462 


3.1829 


2.4668 


2.4414* 


2.0456 


2.3201 


2.2506 


1.9965 


1.4493 


1.9965 


0.9668 


0.9498* 


0.9641 


0.8550 


0.6754 


0.6091 



Ui est. 


Ui act. 


0.5658 


0.5729 


0.3900 


0.4712* 


0.0666 


0.3356 


0.1853 


0.2679 


0.0527 


0.2170 


0.1875 


0.1458* 


0.0972 


0.0096 


0.0341 


-0.0188 


-0.1484 


-0.1468 


-0.0954 


-0.1622* 


-0.4950 


-0.5036 



Ui est. 


Ui act. 


-0.5205 


-0.5166 


-0.4302 


-0.5292 


-0.5136 


-0.5456* 


-0.1878 


-0.5576 


-0.5954 


-0.5844 


-0.6176 


-0.6407 


-0.6576 


-0.6473* 


-0.7028 


-0.7083 


-0.7537 


-0.7083 


-0.8160 


-0.7083 



Zi est. 


Zi &Ct . 


0.8827 


0.9 


3.1798 


3.2 


5.0657 


5.1 


6.5621 


6.3 


7.3215 


7.5 


8.7334 


9.15 


12.2156 


12.6 


14.1592 


14 



Table 1. The estimated vs. actual reconstructed ?>D data, ordinal (ui) and Euclidean 
{Zi), for the Medium Depth Sequence. For conciseness, the height Zi is only shown for 
every fourth point, and the corresponding ordinal values are marked by *. 



Given the heights of the two reference points, we can continue further and 
transform the ordinal reconstruction into Euclidean reconstruction by computing 
the actual height Zi at each point. The results - some of the computed Zi vs. 
the actual Zi - are shown in Table E 

Looking closely at these results, clearly the ordinal reconstruction by Ui is 
accurate at most of the points, with only a few points close in height switched 
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around. The heights are also quite accurate for most points, as the few repre- 
sentative heights in Table ^show. 



4.2 Large Depth Sequence 

Objects spanning 60cm at about 60cm from the camera (Fig Cfc) were pho- 
tographed from 5 different points of view. Unlike the previous sequence, there is 
much larger depth of field in this sequence, and thus there are larger projective 
distortions. Such distortions make metric reconstruction difficult. Once again, 
we first affine-normalized all the images and followed a procedure similar to the 
previous experiment to recover ordinal height (ui) for 24 tracked points. The 
estimated vs. the actual 3D data are shown in Table El 



Ui est. 


Ui act. 


0.0077 


0* 


-0.1529 


-0.3068 


-0.167 


-0.3068 


-2.4833 


-3.079 


-32.2960 


oo* 


15.8281 


oo 


10.8487 


8.2697 


8.4828 


6.3750 



Ui est. 


Ui act. 


1.4512 


1.3885* 


1.3658 


1.3817 


1.1322 


1.1103 


1.1175 


1.1103 


1.2088 


1.0893* 


1.072 


1.0568 


1.0901 


1 


1.089 


1 



Ui est. 


Ui act. 


3.8829 


3.7841* 


2.4482 


2.5179 


2.4848 


2.402 


2.6841 


2.25 


2.1878 


2.0795* 


1.7263 


1.6045 


1.6723 


1.6045 


1.4574 


1.4024 



Zi est. 


Zi 3,ct . 


-20.4185 


-20 


-0.2296 


0 


2.138 


2.2 


4.1373 


4.4 


6.9691 


7.4 


8.9946 


10.5 



Table 2. Estimated vs. Actual 3D Data, ordinal {ui) and Euclidean {Zi), for the 
Large Depth Sequence. For conciseness, the height Zi is only shown for every fourth 
point, and the corresponding ordinal values are marked by *. 

Given the heights of the 2 reference points, we can continue further and 
transform the ordinal reconstruction into Euclidean reconstruction by computing 
the actual height Zi at each point. The results - the computed Zi vs. the actual 
Zi - are shown in Table El 

Here, too, the ordinal reconstruction by Ui is almost always accurate. Of the 
30 features, only 3 pairs of neighbors in height were swapped with each other 
(their height was 4 vs. 3.5, 11 vs. 12 and 10.2 vs. 10.5). The Z values are still 
good, but less accurate when compared with the previous experiment. 



4.3 Lab Sequence 

This sequence includes 16 images of a robotic laboratory, obtained by rotating 
a robot arm 120° (one frame is shown in Fig. El)- 32 corner- like points were 
tracked. This sequence has the largest depth of field - the depth values of the 
points in the first frame (relative to the camera) ranged from 13 to 33 feet. 
Moreover, a wide-lens camera was used, causing distortions at the periphery 
which were not compensated for. This is therefore the most difficult sequence so 
far. Once again, following the same procedure as in the other two experiments. 
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Zi est. 


Zi . 


0.2045 


-0.3386 


-0.2048 


0.3409 


2.2095 


1.1965 


8.5974 


1.5249 


8.1943 


2.8690 



Zi est. 


Zi S-Ct . 


9.4100 


5.6942 


7.1877 


6.4347 


9.4263 


9.3823 


9.1724 


9.4161 


9.9265 


9.4161 



Zi est. 


Zi . 


15.3879 


15.8765 


16.1396 


16.5352 


17.4406 


16.6093 


13.6005 


16.6761 


16.5688 


17.5328 



Zi est. 


Zi . 


9.2478 


11.4074 


9.3495 


12.1890 


9.2442 


13.6701 


13.2664 


13.9289 


16.5506 


14.6616 



Zi est. 


Zi &Ct . 


8.3978 


10.0444 


9.0416 


10.0446 


9.1713 


10.9708 


9.5846 


10.9993 


9.9490 


11.2723 



Table 3. Estimated vs. Actual Heights for the Lab Sequence. 



we estimated ordinal and Euclidean reconstruction in 25 points. The estimated 
Zi vs. actual Zi for these points are shown in Table 0 

Note that although there are gross errors for some points, the computed 
height in most points is fairly close to their true height. 



4.4 Dense Height Maps 

The previous three examples gave quantitative illustrations of stratified recon- 
struction at a sparse set of points in the scene. Here we include two examples of 
dense reconstruction of ordinal heights. In both cases, the images were acquired 
using a hand-held video camera. No quantitative information about the scene 
structure, the camera imaging parameters, or its motion were available. 

Given two input images, we used the method described in m to first estimate 
the homography that aligns a dominant planar surface in the scene between the 
two images. The residual parallax displacements between the points were then 
computed using the method described in ng. These displacements were used as 
input for the stratified reconstruction of the scene. 

The first example uses the “Toys” image sequence previously used in mj, 
see Figure Et. The scene consists of a few toys standing on a rug. 

In order to register the images to the reference plane {upto 2D affine trans- 
formation), we used the following approach. Using commonly available image 
manipulations tools on a PC, we drew two pairs of lines on the reference image, 
that were visually judged to be parallel to the “grooves” of the rug. These are 
shown as grey lines painted on the carpet in Figure Note that the vertical 
pair in particular is not parallel in the image itself, although it represents parallel 
lines on the rug. We then interactively warped the images (using homographies) 
until the lines appeared parallel in the image. Any of a family of homographies 
that are equivalent upto a 2D affine transformation is sufficient for this pur- 
pose. We arbitrarily picked one to achieve the intended effect. The result of this 
process is shown in Figure Eb- 

We then computed the ordinal height Ui (see (0I ) using the parallax displace- 
ments (also appropriately warped to reflect the projection onto the reference 
plane). However, since Ui = oo for points on the reference plane itself (i.e., when 
Zi = 0 in ( 0 ), we display — in FigureEfc. As evident from this image, the points 
on the rug are dark (corresponding to an ordinal height of 0); also, a gradual 
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Fig. 2. a) one frame from the toys sequence; b) same image as (a) but with lines drawn 
on it (see text); c) the result of projecting (b) onto the reference plane (the rug); d) 
ordinal height. 



increase in height along the two dominant objects is noticeable. Since we did not 
have the actual heights of any of the objects in the scene, we stopped at this 
stage of our stratified reconstruction. 

The second example uses a new sequence, which we will refer to as the “Doll” 
sequence (see Figure In this case, the carpet and the floor constituted the 
reference plane. The grid lines on the carpet and the floor served as the basis 
for registering the images to the reference plane. As in the case of the “Toys” 
example, we interactively warped the images until these lines appeared parallel 
in the image, and arbitrarily picked one homography that achieved this effect. 
Once again, we display in Figure Oh. 

Note that in both these examples, even with the minimal calibration (two sets 
of parallel lines) we obtain a reconstruction that looks qualitatively consistent 
with the actual structure of the scene. As noted earlier, this would be useful in 
a number of visual reasoning tasks, such as navigation and grasping. 

4.5 Discussion 

The computation of height (a metric quantity) relative to the reference plane 
consistently gave good results while using noisy sequences with large perspective 
distortions. The algorithm is fast, robust and easy to implement since it only 
solves a linear system of equations. Height was computed with minimal scene 
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Fig. 3. a) one frame from the doll sequence; b) the result of projecting (a) onto 
the reference plane (the rug); c) ordinal height. 



calibration: (i) ordinal reconstruction required knowledge of the reference 
frame up to affine transformation - 2 degrees of freedorr0, and (ii) exact height 
required knowledge of the height of two points not on the reference plane - 2 
additional d.o.f. Thus we used 2-4 d.o.f. of scene calibration information, much 
less than required by other reconstruction algorithms which use scene calibration. 

For example, the scene calibration needed by the reconstruction algorithm 
described in HH includes the specification of the 3D coordinates of 5 reference 
points (supplied by an oracle) - 15 d.o.f. We used this method in |5| to accom- 
plish reconstruction with the first two sequences described above. Although this 
computation relied on 15, rather than 4, pre-determined pieces of calibration 
data, and involved a complex non-linear algorithm, the reconstruction results in 
|0] are no better than our results here (e.g., a relative error of about 5% — 10% 
per datapoint using the second sequence). 

5 Summary 

Since uncalibrated images only permit projective reconstruction, no metric in- 
formation (such as relative depth) can be deduced without some calibration, 
either camera calibration (external and internal parameters of the camera) or 
scene calibration (the 3D affine or Euclidean coordinates of some 3D landmarks). 
Camera calibration, however, is not always possible: it requires a partially fixed 
camera (unsuitable, e.g., if the images are taken by many cameras) or some con- 
trol over the camera motion (unsuitable, e.g., with video data). Scene calibration 
requires a priori knowledge of known 3D points, and is typically employed after 
the projective reconstruction; therefore it is typically ill-advised to use directly 
a least squares linear reconstruction algorithm as we do here, since there is no 
suitable least squares error in 3D projective space. 

^ The 2D affine transformation Gt has 6 degrees of freedom, whereas a general 2D 
projective transformation (homography) has 8 d.o.f.; thns 2D affine plane calibration 
requires the specification of 2 d.o.f. 
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Our contribution in this paper is three fold: (1) We use partial scene cali- 
bration, as little as two parallel lines on a plane; this information need not be 
given a priori, and can be inferred from the images directly. (2) We perform 
the calibration prior to the reconstruction; this allows the use of a robust least 
squares structure computation from many frames. (3) We obtain a hierarchy 
of intermediate representations, from ordinal to Euclidean, which increasingly 
depend on the amount of scene calibration available. 
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Abstract. Using Euclidean constraints to model large 3D environments 
is made possible. This has been a challenging issue for many years. Using 
such knowledge not only enlarges the number of feasible cases, but it also 
provides perfect results, unreachable formerly. We deal with a limited set 
of constraints composed of incidence relations, parallelism, and orthogo- 
nality. This knowledge is given manually, processed through a geometric 
reasoning system, and used during what we call a constraint bundle ad- 
justment. Results are very encouraging, even though the computational 
time may be prohibitive. 



Keywords: Geometric Reasoning, Self-Calibration, Euclidean Constraints. 

1 Introduction 

Modeling large 3D environments has been a major challenging question for the 
past decades. Many approaches have been tried, using technologies from Pho- 
togrammetry 0, Laser-Range Metrology 0, and Computer Vision Building 
a system that will automatically model a 3D environment under every circum- 
stance is for the time being far from being realized. Many results have already 
been obtained namiaEi. But, though for some applications they may seem 
quite acceptable, they are always computed up to sensor accuracy. To reach 
absolute accuracy, measurement must become certitude or knowledge. For in- 
stance, even if you measure 2 planes as being parallel, they are not parallel 
mathematically speaking until you impose them to be so. So the idea of using 
knowledge of the environment to both stretch the feasible cases, and to reach 
absolute accuracy, is very appealing. However, it is a very difficult task to use 
such information, and very little has been done on the subject. 

In Computer Vision, we use self-calibration techniques to estimate both the 
scene and the camera parameters. Those techniques have proved difficult be- 
cause small perturbations in the 3D space infer bigger perturbations in the 
calibration^^. Thus, to cancel out 3D perturbations, we want to impose Eu- 
clidean constraints on a reconstruction, and accordingly, adjust the calibration. 
So we need a set of images with extracted features such as points, segments, lines, 
and planes. Those features have to be matched among the image sequence, and a 
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calibration must be estimated. We then want to use Euclidean constraints such 
as: 



— incidence constraints, e.g. points belong to lines or planes; lines belong to 
planes 

— parallelism of lines or planes 

— orthogonality of lines or planes 

We teach our system such constraints manually. 

In the following we explain what we can do with such knowledges. First, we 
introduce the Ritt-Wu’s method, a technique for automatically proving geome- 
try theorems. Second, we show how this method can be use to find a minimal 
parametrization of the scene. Third, we introduce our constraint bundle adjust- 
ment. Lastly, results are shown for a real situation. 



2 Geometry and Symbolic Algebra Computation 

2.1 An Introduction to the Ritt-Wu’s Method 

The Ritt-Wu’s method |I3] jE] |21 is a technique to prove geometric theorems 
automatically. This method has already proved more than five hundred non- 
trivial theorems 0. A geometric theorem is composed of a set of hypotheses 
(geometric configuration) and a set of conclusions. The hypotheses are defined 
with a set of geometric primitives {e.g. points, lines, planes, ...) and a set of 
geometric constraints {e.g. incidence relations, parallelism, orthogonality, ...). A 
conclusion for such a theorem is a geometric constraint. To deal with geometric 
primitives, we must express the geometry in an algebraic notation. One possible 
way to do this, is to choose a coordinate system of the space. 

In this case, the geometric primitives are defined by several parameters and 
the geometric properties by some polynomial equations in those parameters. A 
geometric configuration can thus be seen like a set of variables (the parameters 
of the 3D primitives) and a system of polynomial equations (the geometric con- 
straints). Roughly speaking, a geometric configuration is an algebraic manifold 

El- 

More precisely, a geometric configuration might also have degenerate condi- 
tions. For example, if we look at the median theorem: 

Let ABC he a triangle. Let B' , C he the midpoints of AC, AB respec- 
tively. Let G he the intersection of the lines BB', CC . Then, the point 
of intersection of the lines AG, BC is the midpoint of BC . 

Obviously, if the points A, B, and C are collinear, the theorem is false. The 
equation, det{A, B,C) = 0 represents a degenerate condition. The set of de- 
generate conditions also defined a manifold. Thus, a geometric configuration is 
represented, by the manifold defined by the geometric properties, excluding the 
manifold defined by the degenerate conditions. 
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Therefore, in order to prove a theorem, we must: 

— find the manifold of hypotheses H defined by the geometric constraints. 

— find the manifold of degenerate conditions D. 

— represent the conclusion of the theorem as a set of polynomial equations C. 
Then, the theorem is true if and only if the equations of C are satisfied on H 
excluding D. 

We can decompose the proving of a theorem in 2 problems. First, we must 
insure that a polynomial of C vanishes on the manifold H . Second, we must find 
the manifold of degenerate conditions. 

Now, let’s focus on the first problem. We call PS a polynomial system defin- 
ing the manifold H, and P a polynomial of C. In the particular case where all 
the polynomials of PS, and P, are linear, the solution is trivial. One way of 
solving this problem consists of solving the system PS (by a triangular form of 
PS) and substituting this result into the polynomial P. After substitution, P is 
zero if and only if P vanishes on the manifold P[. In the general case, Ritt and 
Wu introduced: the notion of ascending chains (triangular forms of polynomial 
systems), the notion of pseudo-division (in the linear case, pseudo-division is 
substitution), and the notion of a characteristic set CS of PS. The interesting 
properties of CS are: 

— CS" is an ascending chains {i.e. a triangular form of a polynomial system). 

— If P vanishes on H, the result of pseudo-division of P by CS is zero. 

— Conversely, if P[ is an irreducible manifold, and if the result of pseudo- 
division of P by CS is zero, then P vanishes on H. 

In other words, a characteristic set is a means of verifying that P vanishes on the 
manifold H . Ritt and Wu also gave an algorithm for computing a characteristic 
set automatically P!> which is out of the scope of this article. 

Now, let’s focus on the second problem. Let’s call the leading coefficient of a 
polynomial P, the polynomial coefficient associated to the highest degree of P 
considered as a polynomial in its highest variable. During the computation of a 
characteristic set, the leading coefficients of the polynomials of the characteristic 
set must sometimes be tested to be different from zero. Ritt and Wu showed that 
the leading coefficients, when vanishing, describe the degenerate conditions. 

We can now prove a theorem. Let T be a theorem. If the manifold defined by 
the hypotheses iJ of T is irreducible, then the Ritt-Wu’s method can prove the 
theorem T. If H is reducible, then the Ritt-Wu’s method is not sufficient, and 
we have to used the complete Wu’s method (we do not describe this method in 
this paper) |l4j . Fortunately, in almost every case, the manifold H is irreducible 
and the Ritt-Wu’s method is sufficient. In the other cases, the coordinate system 
is not adapted to the geometric configuration, or at least one primitive of the 
geometric configuration is not completely determined. 

2.2 Constraints and Parametrization of a Geometric Model 

A geometric model is defined by a set of primitives {e.g. points, lines and planes) 
and a set of constraints among those primitives. Let M be a geometric model. 
Choosing a coordinate system allows us to give a parametrization of M as follows: 



Imposing Euclidean Constraints During Self-Calibration Processes 227 



— Each point is given three parameters (the coordinates). 

— Each line is given six parameters (two points). 

— Each plane is given three parameters (the coefficients of the equation, nor- 
malizing the constant term to 1, which is possible if the plane does not go 
through the origin of the coordinate systeirQ) . 

The Ritt-Wu’s method does not restrict the choice of geometry (Projective, 
Affine or Euclidean) nor the choice of constraints. This technique works with 
anything which can be written as polynomial equations. In this paper, we use 
only incidence, parallel, and orthogonal constraints. With such a limited set of 
constraints, it is easy to compute the polynomial equations which represent a 
geometric constraint automatically. For example, we represent the orthogonality 
between 2 planes P'{n, 1) and P'{n' , 1), by the polynomial equation : 

n ■ n' = 0 

It can be seen that all our constraints are linear with respect to each parameter 
taken individually. For instance, -I- UyTi'y + rizu'^ = 0 is linear in all its 

parameters {rij,, Uy, Uz, n'y, n(,}. 

Now, for the model M, we have: a full parametrization of the primitives X, a 
set of polynomial equations PS representing the constraints, and H the manifold 
defined by PS. Remembering that a characteristic set C'S' of PS is a triangular 
form of a polynomial system, it is possible to separate the set of parameters 
into 2 parts. The first one X^., is the set of the leading parameters (the highest 
variable) in each polynomial in CS. The second one Xm, is the set of the other 
parameters (tabled). In the following, we note Y the set of numerical values of a 



/i G K[X^,xi\ 

/2 G K\X m t Xl,X2] 



fn € K[X m ■; ^1 j ^2 t ' ' ' 
where = {xi,X 2 ,- ■ ■ 

Table 1. A triangular polynomial system. 

set of parameters Y. The theory of characteristic sets HH PI [El assumes that: 

— The knowledge of Xm (the values of the parameters of Xm) allows us to 
infer X. More precisely, there exists an order on the parameters of Xc which 
allows us to evaluate any parameter from the knowledge of the parameters 
which come before it. Roughly speaking, we compute each parameter by a 
back-substitution over CS. 

— The knowledge of Xm infer a finite number of evaluations of X. This tells 
us that the number of parameters in Xm is minimal. 

^ this is always made possible by moving the origin. 
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— The number of parameters of Xm is equal to the dimension of an undegen- 
erated component of H. 

When computing X, the back-substitution may lead us to solve polynomials 
with degree greater than 1, which obviously have several roots. Thus X may not 
be defined uniquely. 

In our case, PS is linear in each parameter. Moreover, it is possible to show 
that H has only one undegenerated component El. As a consequence, with 
such a limited set of constraints, we are able to : in one hand give a minimal 
parametrization of the model M, and in the other hand give a finite set of 
possible values for every parameter of M. 

3 Algorithm 

Defining a minimal parametrization is equivalent to defining a mapping from 
the 3D space to a set of values. The problem is to go the other way, i.e. from 
the set of minimal parameters, we need to infer the 3D location of all our 3D 
primitives. This inverse mapping is not straightforward as some equations may 
provide several solutions. Each time we compute a value of a parameter from 
an implicit equation, if the equation has different solutions, we need to choose 
the one that corresponds to a suitable 3D interpretation. More precisely, as we 
start from an initialization, at each step we have an estimation of any primitive 
location. Thus, we can choose during ambiguous cases, the closest solution to 
the current estimation. Doing so, we define the inverse mapping that produces 
from the minimal parametrization, the 3D location of all primitives. 

We can now adjust the parametrization from the images and thus define our 
constrained bundle adjustment. In fact, from a current minimal parametrization 
Xm, we can infer X which encodes the 3D coordinates of all primitives P. Then, 
using the current calibration parameters Xcam, we can defined pj as the projec- 
tion in image j of the primitive Pi, i.e. = proj^(Pi). We can now compare, 
using an image distance d, a primitive defined originally in image j, p^, with its 
projected counterpart pj . Noting as Pind the set of indices describing the set of 
primitives and as Imd the set of indices describing the set of images, we thus 
defined a cost function: 



With such a cost function we can define a constraint bundle adjustment. In 
fact, in order to get a calibration and a constraint reconstruction, we minimize 
the cost function m over the camera parameters and the minimal parametriza- 
tion of the scene: 



C{X, 





( 1 ) 



iePir,d,jeii„d 



min_ 

m) 



cam ? 
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4 Experiment 

We start from a reconstruction estimated by TotalCalib^j. This reconstruc- 
tion (fig Q) comes from a self-calibration process, i.e. in particular, we do not 
have any knowledge of the scene. Meanwhile, from our comprehension of the 




Fig. 1. ’La place des Arcades’ 



scene, we think that some Euclidean constraints have to be respected. In fact, 
the scene being made of buildings, some plans of the scene are parallel, others 
are orthogonal, and some points belong to some plans. . . Here is an orthographic 
top view (fig 121) of the estimated structure of the scene without imposing any 
knowledge. 

This particular scene can be structured with 7 planes (A,B,C,D,E,F,G,H), 
where A is defined with 4 points (22,23,32,35), B with 4 points (25,26,31,33),. . . , 
up to H which is defined with 5 points (22,34,36,37,46). We want to impose: first, 
that all points belong to their respective planes; second, that A, B, and C are 
rectangle; third, that the following Euclidean constraints are respected: 

{A\\B\\C 
D\\E 
F\\G\\H 
A±G 
D AH 

Tables I12i;tll summarize the values of the Euclidean constraints after the self- 
calibration process. In particular, table El shows the distances (in percentage of 
the size of the scene) of all points to the planes to which they belong. Table 0 
shows the angles between normals of planes, i.e. the angle should be 0.0 for 
parallel planes and 90.0 for orthogonal planes. 
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Fig. 2. ’La place des Arcades’ seen from above 



This scene, being made of 22 points, has 66 (3*22) parameters or degrees 
of freedom. Imposing the Euclidean constraints makes the number of param- 
eters come down to 41. Now, running a constraint bundle adjustment on the 
calibration parameters and the 41 parameters of the scene gives us a reconstruc- 
tion ifigs I3I4II with Euclidean constraints perfectly satisfied! The final residual 
after the second bundle adjustment is 1.21pixel, versus O.Qlpixel for the first 
one. As the 2D points are defined up to pixel accuracy above images, the small 
residual values mean that we gain the major improvement of imposing 3D con- 
straints without perturbing too much the calibration. 

5 Conclusion 

Using Euclidean constraints, generally speaking, is a very difficult problem. Here, 
with a limited set of constraints and the use of a powerful algebraic theory, we can 
obtain perfect modeling results. Unfortunately, extending the set of constraints, 
to known distances for instance, may be very difficult because it induces solv- 
ing polynomial systems of much higher degrees. Theoretically, when using more 
constraint types, the existence of a minimal parametrization is not guaranteed 
anymore. Moreover, though the theory of the characteristic set is a powerful 
tool, using it requires a lot of resources, which for the time being, becomes too 
much on larger environments. 



Imposing Euclidean Constraints During Self-Calibration Processes 233 



6 Appendix 

We illustrate the Ritt-Wu’s method with the following example. 

Simpson’s theorem: Lets ABC be a triangle and C its circumscribed. Let D be a 
point on C. Lets E, F , and G be the images of D by the respective perpendicular 
projections on the sides BC, CA, and AB. Then E, F, and G are collinear. 



/e 




Proof : 

We give coordinates to the points A = ( 0 , 0 ), B = (ui, 0 ), C = {u2,us), 
O = (xi,X2), D = {X3,U4), E = (X3,X4), F = {x7 , xq) and G = (x3,0). 

We represent the hypotheses by a set of polynomial equations PS: 

OA = OC <1=^ hi = 2 u 2 X 2 + 2 usXi — Ug — = 0 

OA = OB /i2 = 2 u\X 2 — u\ = Q 

OA = OD /13 = — -I- 2x2X3 -I- 2 u 4 Xi — U4 = 0 

(DF), (AC) perpendicular /14 = U2X7 + u^xq — U2X3 — U3U4 = 0 

(DE), {BC) perpendicular /15 = {u2 — ui)x5+U3X4 + {ui — U2)x3 — U3U4 = 0 

F, A, C collinear hg = U3X7 — U2 Xq = 0 

E, B, C collinear hr = M3X5 -I- (—M2 + mi)x4 — M1M3 = 0 

Now, We compute a characteristic set CS of PS by Ritt-Wu’s method: 

/I = — 2 (MiM 2 )a;i + Mi(m| + M2 — M1M2) 

/2 = 2(mi)x 2 - Ml 

/3 = — (miM 3 )x 3 -I- M^M 3 X 3 -|- MiM 4 (m 2 + M§ — M1M2) 

/4 = -((mi - M2)^ -I- M§)x4 -I- (m2 - Mi)m3X3 -|- Ms(Mi - M1M2 -f M3M4) 

/5 = M3((mi — M2)^ + m|)x5 — Ms(mi — M2)^X3 -|- MiM§M4 — M1M3 — M2M3M4 
/6 = (m2 + m|)x6 — M3M2X3 — M3M4 

/7 = M3(m| -I- m|)x7 — M2M3X3 — M2m|m4 
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Table 4. The distribution of the variables in the polynomial equations PS. 




Table 5. The distribution of the variables in the polynomial equations CS. 



The conclusion, E, F, and G are collinear, is represented by: 
g = X4X7 + (x3 — x^)xq — X3X4 = 0 . 

Now, as the result of pseudo-division of g by CS is zero, the Simpson’s Theorem 
is true. Moreover, it also provides us with the following degenerate conditions: 



U1U3 = 0 
wi = 0 

(Ul — U2)^ -I- = 0 

U3{{ui - ^ 2 )^ -I- u§) 

u§ + U2 = 0 
U3(u1 + U^) =0 



A = B or A, B and, C are collinear 
A=B 
B = C 

A, B and, C are collinear or B = C 
A = C 

A = C or A, B and, C are collinear 
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Abstract. We present some recent progress in designing and imple- 
menting two interactive image-based 3D modeling systems. 

The first system constructs 3D models from a collection of panoramic 
image mosaics. A panoramic mosaic consists of a set of images taken 
around the same viewpoint, and a camera matrix associated with each 
input image. The user first interactively specifies features such as points, 
lines, and planes. Our system recovers the camera pose for each mosaic 
from known line directions and reference points. It then constructs the 
3D model using all available geometrical constraints. 

The second system extracts structure from stereo by representing the 
scene as a collection of approximately planar layers. The user first in- 
teractively segments the images into corresponding planar regions. Our 
system recovers a composite mosaic for each layer, estimates the plane 
equation for the layer, and optionally recovers the camera locations as 
well as out-of-plane displacements. 

By taking advantage of known scene regularities, our interactive systems 
avoid difhcult feature correspondence problems that occur in traditional 
automatic modeling systems. They also shift the interactive high-level 
structural model specification stage to precede (or intermix with) the 
3D geometry recovery. They are thus able to extract accurate wire frame 
and texture-mapped 3D models from multiple image sequences. 



1 Introduction 

A lot of progress has been made recently in developing automated techniques 
for 3D scene reconstruction from multiple images, both with calibrated and 
uncalibrated cameras El Q E] ^ 1^ ES| • Unfortunately, the results 

from many automated modeling systems are disappointing due to the complexity 
of real scenes and the fragility of fully automated vision techniques. Part of 
the reason stems from the accurate and robust correspondences required by 
many computer vision techniques such as stereo and structure from motion. 
Moreover, such correspondences may not be available in regions of the scene 
that are untextured. 
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Automated techniques often require manual clean-up and post-processing to 
segment the scene into coherent objects and surfaces, or to triangulate sparse 
point matches j2]. They may also be required to enforce geometric constraints 
such as known orientations of surfaces. For instance, building interiors and exte- 
riors provide vertical and horizontal lines and parallel and perpendicular planes. 
In this paper, we attack the 3D modeling problem from the other side: we spec- 
ify some geometric knowledge ahead of time (e.g., known orientations of lines, 
co-planarity of points, initial scene segmentations), and use these constraints to 
guide our matching and reconstruction algorithms. 

The idea of using geometric constraints has previously been exploited in sev- 
eral interactive modeling systems. For example, PhotoModeler m is a comme- 
cial product which constructs 3D models from several images, using photogram- 
metry techniques and manually specified points. The TotalCalib system, on the 
other hand, estimates the fundamental matrix from a few hand-matched points, 
and then predicts and verifies other potential matching points |^. The Facade 
system exploits the known rectahedral structure of building exteriors to directly 
recover solid 3D models (blocks) from multiple images [|H). 

This paper presents two interactive (semi-automated) systems for recovering 
3D models of large-scale environments from multiple images. Our first system 
uses one or more panoramic image mosaics, i.e., collections of images taken from 
the same viewpoint that have been registered together [S^. Panoramas offer sev- 
eral advantages over regular images. First, we can decouple the modeling prob- 
lem into a zero baseline problem (building panoramas from images taken with 
rotating camera) and a wide baseline stereo or structure from motion problem 
(recovering 3D model from one or more panoramas). Second, the intrinsic cam- 
era calibrations are recovered as part of the panorama construction . Due to 
recent advances, it is now possible to construct panoramas even with hand-held 
cameras m- 

Unlike previous work on 3D reconstruction from multiple panoramas 1231 EDI, 
our 3D modeling system exploits important regularities present in the environ- 
ment, such as walls with known orientations. Fortunately, the man-made world 
is full of constraints such as parallel lines, lines with known directions, planes 
with lines and points on them. Using these constraints, we can construct a fairly 
complex 3D model from a single panorama (or even a wide-angle photograph), 
and easily handle large co-planar untexture regions such as walls. Using multiple 
panoramas, more complete and accurate 3D models can be constructed. 

Multiple-image stereo matching can be used to recover a more detailed de- 
scription of surface shape than can be obtained by simply triangulating matched 
feature points m- Unfortunately, stereo fails in regions without texture. A sim- 
ple depth map also cannot capture the full complexity of a large-scale environ- 
ment. Methods for overcoming this limitation include volumetric stereo tech- 
niques EDI ESI and model-based stereo En- 
in this paper, we propose a different approach — extending the concept of 
layered motion estimates BT!j and “shallow” objects E3 to true multi-image 
stereo matching. Our second interactive modeling system reconstructs the 3D 
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scene as a collection of approximately planar layers, each of which has an explicit 
3D plane equation, a color image with per-pixel opacity, and optionally a per- 
pixel out-of-plane displacement P| . This representation allows us to account for 
inter-surface occlusions, which traditional stereo systems have trouble modeling 
correctly. 



2 3D Modeling from Panoramas 

2.1 Interactive Modeling System 

Our modeling system uses one or more panoramas. For each panorama, we draw 
points, lines, and planes, set appropriate properties for them, and then recover 
the 3D model. These steps can be repeated in any order to refine or modify 
the model. The modeling system attempts to satisfy all possible constraints in 
a consistent and coherent way. 

Three coordinate systems are used in our work. The first is the world coor- 
dinate system where the 3D model geometry is defined. The second is the “2D” 
camera coordinate system (panorama coordinates). The third is the screen co- 
ordinate system where zoom and rotation (pan and tilt, but no roll) are applied 
to facilitate user interaction. While each panorama has a single 2D coordinate 
system, several views of a given panorama can be open simultaneously, each with 
its own screen coordinate system. 

We represent the 3D model by a set of points, lines and planes. Each point is 
represented by its 3D coordinate x. Each line is represented by its line direction 
m and points on the line. Each plane is defined by (n, d) where n is the normal, 
d is the distance to the origin, and n • x -|- d = 0 or (n, d) • (x, 1) = 0. A plane 
typically includes several vertices and lines. 

Each 2D model consists of a set of 2D points and lines extracted from a 
panorama. A panorama consists of a collection of images and their associated 
transformations. A 2D point x (i.e., on a panorama) rraresents a ray going 
through the 2D model origin (i.e., camera optical center) U Likewise, a 2D line 
(represented by its line direction m) lies on the “line projection plane” (with 
normal hp) which passes through the line and 2D model origin 0 

2.2 Modeling Steps 

Many constraints exist in real scenes. For example, we may have known quanti- 
ties like points, lines, and planes. Or we may have known relationships such as 
parallel and vertical lines and planes, points on a line or a plane. With multi- 
ple panoramas, we have more constraints from corresponding points, lines, and 
planes. 

^ We use the notation x for a 2D point, x for a 3D point, and x for a 3D point whose 
position is known. Likewise for line directions, plane normals, etc.. 

^ If a pixel has the screen coordinate (u, v, 1), its 2D point on the panorama is repre- 
sented by (u, V, f) where f is the focal length. 
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Some of these constraints are bilinear. For example, a point on a plane is a bi- 
linear constraint in both the point location and the plane normal. However, plane 
normals and line directions can be recovered without knowing plane distance and 
points. Thus, in our system we decouple the modeling process into several lin- 
ear steps: (a) recovering camera orientations (R) from known line directions; 
(b) recovering camera translations (t) from known points; (c) estimating plane 
normals (n) and line directions (m); (d) estimating plane distances (d), vertex 
positions (x). These steps are explained in detail in the next sections. 

2.3 Recovering Camera Pose 

The camera poses describe the relationship between the 2D models (panorama 
coordinate systems) and the 3D model (world coordinate system). 

To recover the camera rotation, we use lines with known directions. For 
example, one can easily draw several vertical lines at the intersections of walls 
and mark them to be parallel to the Z axis of the world coordinate system. 
Given at least two vertical lines and a horizontal line, or two sets of parallel lines 
of known directions, the camera matrix can be recovered. This is achieved by 
computing vanishing points for the parallel lines, and using these to constrain 
the rotation matrix. If more than 2 vanishing points are available, a least squares 
solution can be found for R. 

To recover the translation, observe that a point on a 2D model (panorama) 
represents a ray from the camera origin through the pixel on the image, 

(x — t) X R^x = 0. (1) 

This is equivalent to 

(x-t).(R^p,) = 0,j = 0,l,2, (2) 

where po = (— 0), pi = (— 0:3,0, a;i) and p2 = (0,— 2:3, 0:2) are three 
directions perpendicular to the ray x = (xi,a; 2 ,X 3 ). Note that only two of the 
three constraints are linearly independent^ Thus, camera translation t can be 
recovered as a linear least-squares problem if we have two or more given points. 
Given a single known point, t can be recovered only up to a scale. In practice, it 
is convenient to fix a few points in 3D model, such as the origin (0, 0, 0). These 
given points are also used to eliminate the ambiguities in recovering camera pose. 

For a single panorama, the translation t is set to zero if no point in 3D model 
is given. This implies that the camera coordinate coincides with the 3D model 
coordinate. 

2.4 Estimating Plane Normals 

Once we have camera pose, we can recover the scene geometry. Because of the 
bilinear nature of some constraints (such as points on planes), we recover plane 



3 



The third constraint with minimum |lPi|P is eliminated. 
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normals (n) before solving for plane distances (d) and points (x). If a normal 
is given (north, south, up, down, etc.), it can be enforced as a hard constraint. 
Otherwise, we compute the plane normal n by finding two line directions on the 
plane. 

If we draw two pairs of parallel lines (a parallelogram) on a plane, we can 
recover the plane normal. Because R has been estimated, and we know how to 
compute a line direction (i.e., the vanishing point in) from two parallel lines, we 
obtain m = R^m. From two line directions mi and m 2 on a plane, the plane 
normal can be computed as n = mi x m 2 . 

In general, the line direction recovery problem can be formulated as a stan- 
dard minimum eigenvector problem. Because each “line projection plane” is 
perpendicular to the line (i.e., hpi • m = 0), we want to minimize 

e = ^(hpi • xhf = iii^(^ np,hji)m. (3) 

i i 

This is equivalent to finding the vanishing point of the lines |2|. The advantage 
of the above formulation is that the sign ambiguity of hpi can be ignored. When 
only two parallel lines are given, the solution is simply the cross product of two 
line projection plane normals. 

Using the techniques described above, we can therefore recover the surface 
orientation of an arbitrary plane (e.g., tilted ceiling) provided either we can draw 
a parallelogram (or a 3-sided rectangle) on the plane. 

2.5 Estimating the 3D Model 

Given camera pose, line directions, and plane normals, recovering plane dis- 
tances (d), 3D points (x), and camera translation t (if desired), can be for- 
mulated as a linear system consisting of all possible constraints. By separating 
hard constraints from soft ones, we obtain a least-squares system with equal- 
ity constraints. Intuitively, the difference between soft and hard constraints is 
their weights in the least-squares formulation. Soft constraints have unit weights, 
while hard constraints have very large weights [E|. 

Some constraints (e.g., a point is known) are inherently hard, therefore equal- 
ity constraints. Some constraints (e.g., a feature location on a 2D model or 
panorama) are most appropriate as soft constraints because they are based on 
noisy image measurements. Take a point on a plane for an example. If the plane 
normal h^ is given, we consider the constraint (x^ ■ hk + dk = 0) as hard. We 
use the notations m and h to represent the given line direction m and plane 
normal n, respectively. This implies that the point has to be on the plane, only 
its location can be adjusted. On the other hand, if the plane normal is esti- 
mated, we consider the constraint (x^ -iik + dk = 0) as soft. This could lead to an 
estimated point that is not on the plane at all. So why not make the constraint 
(xi ■ Uk + dk = 0) hard as well? 

The reason is that we may end up with a very bad model if some of the esti- 
mated normals have large errors. Too many hard constraints could conflict with 
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one another or make other soft constraints insignificant. To satisfy all possible 
constraints, we formulate our modeling process as an equality-constrained least- 
squares problem. In other words, we would like to solve the linear system (soft 
constraints) Ax = b subject to (hard constraints) Cx = q where A is m x n, 
C is p X n. A solution to the above problem is to use the QR factorization uni 
Before we can apply the equality-constrained linear system solver, we must 
check whether the linear system formed by all constraints is solvable. In general, 
the system may consist of several subsystems (connected components) which can 
be solved independently. For example, when modeling a room with a computer 
monitor floating in the space not connected with any wall, ceiling or floor, we 
may have a system with two connected components. To find all connected com- 
ponents, we use depth first search to step through the linear system. For each 
connected components we check that: (a) the number of equations (including 
both hard and soft constraints) is no fewer than the number of unknowns; (b) 
the right hand side is a non-zero vector, i.e., has some minimal ground truth 
data; (c) the hard constraints are consistent. If any of the above is not satisfied, 
the system is declared unsolvable, and a warning message is then generated to 
indicate which set of unknowns cannot be recovered. 



3 3D Modeling Using Layered Stereo 

3.1 Overview of Layered Stereo Approach 

Our second 3D modeling system interactively extracts structure as a collection 
of 3D (quasi-) planar layers from multiple images. The basic concepts of the 
layered stereo approach are illustrated in FigureDl Assume that we are given as 
input K images /i(ui), I 2 (u 2 ), . ■ . , /ic(ui^)0 captured by K cameras with camera 
matrices Pi,P2, . . . , P*:. In what follows, we will drop the image coordinates 
unless they are needed to explain a warping operation explicitly. Our hypothesis 
is that we can reconstruct the world as a collection of L approximately planar 
layers. Following jjj, we denote a layer “sprite” image by Li{ui) = {ai ■ ri, ai ■ 
gi, ai ■ bi, ai), where ri = n(u;) is the red band, gi = gi{ui) is the green band, 
bi = bi{ui) is the blue band, and ai = ai{ui) is the opacity of pixel ujQ We 
also associate with each layer a homogeneous vector n; which defines the plane 
equation of the layer via nf x — 0, and optionally a per-pixel residual depth 
offset Zi{ui). 

A number of automatic techniques have been developed to initialize the lay- 
ers, e.g, merging mEHIEI, splitting EEHl, color segmentation PJ plane 
fitting to a recovered depth map. In our system, we interactively initialize the 
layers because we wish to focus initially on techniques for creating composite 
(mosaic) sprites from multiple images, estimating the sprite plane equations, 

^ We use homogeneous coordinates in this section for both 3D world coordinates 
X = (x, y, z, 1)^ and for 2D image coordinates u = (u, v, 1)^. 

® The terminology comes from computer graphics, where sprites are used to quickly 
(re-)render scenes composed of many objects |3S[. 
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Fig. 1. Suppose K images Ik are captured by K cameras P*;. We assume the scene 
can be represented by L sprite images Li on planes nf x = 0 with depth offsets Zi . The 
boolean masks Bki denote the pixels in image Ik from layer Li and the masked images 

Mki = Bki ■ Ik- 



and refining layer assignments. We plan to incorporate automated initialization 
techniques later. 

The input consists of a collection of images Ik taken with known camera 
matrices P^. The camera matrices can be estimated when they are not known 
a priori, either using traditional structure from motion Emuni, or directly 
from the homographies relating sprites in different images [221 . Our goal is 
to estimate the layer sprites Li, the plane vectors nq and the residual depths Zi. 

Our approach can be subdivided into a number of steps where we estimate 
each of Li, n;, and Zi in turn. To compute these quantities, we use auxiliary 
boolean mask images Bki ■ The boolean masks Bki denote the pixels in image Ik 
which are images of points in layer Li. Since we are assuming boolean opacities, 
Bki = 1 if and only if Li is the front-most layer which is opaque at that pixel 
in image Ik- Hence, in addition to Li, ni, and Zi, we also need to estimate 
the boolean masks Bki- Once we have estimated these masks, we can compute 
masked input images Mki = Bki ■ Ik (see Figure 0. 

Given any three of Li, n;, Zi, and Bki, there are techniques for estimating 
the remaining one. Our algorithm therefore consists of first initializing these 
quantities. Then, we iteratively estimate each of theses quantities in turn fixing 
the other three. 
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3.2 Estimation of Plane Equations 

In order to compute the plane equation vector nj, we need to be able to map 
points in masked image Mki onto the plane nf x = 0. If x is a 3D world coordinate 
of a point and is the image of x in camera Pfc, we have: 



Ufc = PfcX 

where equality is in the 2D projective space V^. Since Pfc is of rank 3, we can 
write: 

X = P^Ufc + spfc (4) 

where P^ = P^(PfcP^)“^ is the pseudoinverse of Pfc, s is an unknown scalar, 
and pk is a vector in the null space of Pfc, i.e. PkPk = 0. If x lies on the plane 
nf X = 0 we can solve for s, substitute into Equation and obtain: 

X = ((nfpfc)I-pfcnf)PfcUfc. (5) 

Equation allows us to map a point in image Mki onto the point on plane 
nf X = 0, of which it is an image. Afterwards we can map this point onto its 
image in another camera Pk'~- 



nk' =Pk' {{nfpk)l-Pknf)Pluk = Hik'^k ( 6 ) 

where is a homography (collineation of V^). Equation (jOJ describes the 

image coordinate warp between the two images Mki and Mk'i which would hold 
if all the masked image pixels were images of world points on the plane nf x — 0. 
Using this relation, we can warp all of the masked images onto the coordinate 
frame of one distinguished image, w.l.o.g. image Mu, as follows: 

{H[,oMki)in,) = Mki{H[,n,). 

Here, H', o Mki is the masked image Mki warped into the coordinate frame of 
Mu- 

We can therefore solve for n; by finding the value for which the homographies 
defined in Equation best register the images onto each other. Typically, 
this value is found using some form of gradient decent, such as the Gauss-Newton 
method, and the optimization is performed in a hierarchical (i.e. pyramid based) 
fashion to avoid local extrema |^. To apply this approach, we compute the Jaco- 
bian of the image warp with respect to the parameters of n; . Alternatively, 
we can first compute a set of unconstrained homographies using a standard mo- 
saic construction algorithm, and then invoke a structure from motion algorithm 
to recover the plane equation (and also the camera matrices, if desired) [221 El . 

3.3 Estimation of Layer Sprites 

Before we can compute the layer sprite images Li, we need to choose 2D co- 
ordinate systems for the planes. Such coordinate systems can be specified by a 
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collection of arbitrary (rank 3) camera matrices QzH Then, following the same 
argument used in Equations 0 and we can show that the image coordinates 
Ufc of the point in image M^i which is projected onto the point u; on the plane 
n^x = 0 is given by: 

Ufc = Pfc ((nfqi)I - q;nf) Q^u; = H^Ufc (7) 

where is the pseudo-inverse of Q; and q; is a vector in the null space of 
Qj. The homography can be used to warp the image M^i forward onto the 
plane, the result of which is denoted o Mki- After we have warped the masked 
image onto the plane, we can estimate the layer sprite (with boolean opacities) 
by “blending” the warped images: 

K 

Li = 0H'oMfei (8) 

fe=i 

where ® is a blending operator. 

There are a number of ways the blending can be performed. One simple 
method is to take the mean of the color or intensity values. A refinement is to 
use a “feathering” algorithm, where the averaging is weighted by the distance 
of each pixel from the nearest invisible pixel in M^i Alternatively, robust 
techniques can be used to estimate L; from the warped images. 

3.4 Estimation of Residual Depth 

In general, the scene will not be exactly piecewise planar. To model any non- 
planarity, we assume that the point u; on the plane nj x = 0 is displaced slightly 
in the direction of the ray through u; defined by the camera matrix Q;, and that 
the distance it is displaced is Zi{ui), measured in the direction normal to the 
plane. In this case, the homographic warps used in the previous section are not 
applicable. However, using a similar argument to that in Sect! on s ft . 21 a,n d 1.4 . 31 it 
is easy to show (see also |2I1E1) that: 

Uk = ll[ui + Zi{ui)tki (9) 

where U[ = Pk {{nf qi)l - qinf) is the planar homography of Section 13.31 
tki = Pfcq; is the epipole, and it is assumed that the plane equation vector 
n; = (rix, riy, Uz, UdY' has been normalized so that = 1. Equation @ 

can be used to map plane coordinates u; backwards to image coordinates u/j, 
or to map the image Mki forwards onto the plane. We denote the result of this 
warp by (H(,, Zi) o Mki, or W(, o Mki for more concise notation. 

To compute the residual depth map Zi , we could optimize the same (or a simi- 
lar) consistency metric as that used in Section 2.2 to estimate the plane equation. 
Doing so is essentially solving a simpler (or what would call “model-based” ) 
stereo problem. In fact, almost any stereo algorithm could be used to compute 
Zi. The algorithm should favor small disparities. 



A reasonable choice for Q; is one of the camera matrices P*;. 
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3.5 Pixel Assignment to Layers 

In the previous three sections, we have assumed a known assignment of pixels 
to layer, i.e., known boolean masks B^i which allow us to compute the masked 
image Mki using Mki = Bki ■ Ik- We now describe how to estimate the pixel 
assignments from nj, L;, and Zi. 

We could try to update the pixel assignments by comparing the warped 
images o Mki to the layer sprite images Li. However, if we compared these 
images, we would not be able to deduce anything about the pixel assignments 
outside of the current estimates of the masked regions. To allow the boolean 
mask Bki to “grow”, we therefore compare Wj, o with: 

K 

Li = 0WioMfc,, 

k=l 



where Mki = Bki ■ Ik and Bki is a dilated version of Bki (if necessary, Zi is also 
enlarged so that it declines to zero outside the masked region). 

Given the enlarged layer sprites Li, our approach to pixel assignment is as 
follows. We first compute a measure Pki{ui) of the likelihood that the pixel 
W(, o Ik (ui) is the warped image of the pixel u; in the enlarged layer sprite Li. 
Next, Pki is warped back into the coordinate system of the input image Ik to 
yield: 

Pki = Pkl. 

This warping tends to blur Pki, but this is acceptable since we will want to 
smooth the pixel assignment. The pixel assignment can then be computed by 
choosing the best possible layer for each pixel: 



Bik{uk) 



lifPfei(ufc) = mini, Pkk (uk) 
0 otherwise 



The simplest ways of defining Pki is the residual intensity difference I2HI; 
another possibility is the residual normal flow magnitude CHI. A third possibility 
would be to compute the optical flow between W(, o Ik and Li and then use the 
magnitude of the flow for Pki . 



3.6 Layer Refinement by Re-synthesis 

The layered stereo algorithm described above is limited to recovering binary 
masks Bki for the assignment of input pixels to layers. If we wanted to, we could 
use an EM (expectation maximization) algorithm to obtain graded (continuous) 
assignments [Tnrroj. However, EM models mixtures of probability distributions, 
rather than the kind of partial occlusion mixing that occurs at sprite boundaries 
|Z]. Stereo techniques inspired by matte extraction pg are needed to refine the 
color/opacity estimates for each layer j35) . Such an algorithm would work by 
re-synthesizing each input image from the current sprite estimates, and then 
adjusting pixel colors and opacities so at to minimize the difference between 
the original and re-synthesized images. We are planning to implement such an 
algorithm in future work. 
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4 Experiments 

We have implemented our panorama 3D modeling system on a PC and tested 
it with single and multiple panoramas. The system consists of two parts: the 
interface (viewing the panorama with pan, tilt, and zoom control) and the mod- 
eler (recovering the camera pose and the 3D model). Figured shows a spherical 
panoramic image on the left and a simple reconstructed 3D model on the right. 
The coordinate system on the left corner (red) is the world coordinate, and the 
coordinate system in the middle (green) is the camera coordinate. The panorama 
is composed of 60 images using the method of creating full- view panoramas |33| . 
The extracted texture maps (without top and bottom faces) are shown in Figure 
0 Notice how the texture maps in Figure m have different sampling rates from 
the original images. The sampling is the best (e.g.. Figure |3(b)) when the surface 
normal is parallel with the viewing direction from the camera center, and the 
worst (e.g., FigureEl(d)) when perpendicular. This explains why the sampling on 
the left is better than that on the right in Figure 01(a). Figured shows two views 
of our interactive modeling system. Green lines and points are the 2D items that 
are manually drawn and assigned with properties, and blue lines and points are 
projections of the recovered 3D model. It took about 15 minutes for the authors 
to build the simple model in Figure El In 30 minutes, we can construct the more 
complicated model shown in Figure 0 

Figures0and0show an example of building 3D models from multiple panora- 
mas. Figure 0 shows two spherical panoramas built from image sequences taken 
with a hand-held digital video camera. FigureQshows two views of reconstructed 
3D wireframe model from the two panoramas in Figure 0 Notice that the oc- 
cluded middle area in the first panorama (behind the tree) is recovered because 
it is visible in the second panorama. 

We have applied our layered stereo modeling system to a number of multi- 
frame stereo data sets. A standard point tracking and structure from motion 
algorithm is used to recover a camera matrix for each image. To initialize our 
algorithm, we interactively specify how many layers and then perform a rough 
assignment of pixels to layers. Next, an automatic hierarchical parametric motion 
estimation algorithm similar to 0 is used to find the homographies between the 
layers, as defined in Equation O- For the experiments presented in this paper, 
we set Qi = Pi, i.e. we reconstruct the sprites in the coordinate system of the 
first camera. Using these homographies, we find the best plane estimate for each 
layer using a Euclidean structure from motion algorithm PI- 

The results of applying these steps to the MPEG flower garden sequence are 
shown in Figure 0 Figures isr a) and (b) show the first and last image in the 
subsequence we used (the first seven even images). Figure 0(c) shows the ini- 
tial pixel labeling into seven layers. Figures 0d) and (e) show the sprite images 
corresponding to each of the seven layers, re-arranged for more compact dis- 
play. Note that because of the compositing and blending that takes place during 
sprite construction, each sprite is larger than its footprint in any one of the input 
images. This sprite representation makes it very easy to re-synthesize novel im- 
ages without leaving gaps in the new image, unlike approaches based on a single 
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panorama. 



(d) (e) 

Fig. 3. Texture maps for the 3D model. 




Fig. 4. Two views of the interactive system. 



Fig. 5. A more complex 3D model from a single panorama. 




Fig. 6. Two input panoramas of an indoor scene. 





248 Heung-Yeung Shum et al. 




Fig. 7. Two views of a 3D model from multiple panoramas. 



painted depth map |^. Figure |Hl^f) shows the depth map computed by paint- 
ing every pixel in every sprite with its corresponding color coded Z value, and 
then re-compositing the image. Notice how the depth discontinuities are much 
crisper and cleaner than those available with traditional stereo correspondence 
algorithms. 

Our second set of experiments uses five images taken from a 40-image stereo 
dataset taken at a computer graphics symposium. Figure Elja) shows the mid- 
dle input image, Figure Elb) shows the initial pixel assignment to layers, and 
FigureEHc) shows the recovered depth map. Figures El^d) and (e) show the recov- 
ered sprites, and Figure E^f) shows the middle image re-synthesized from these 
sprites. The gaps visible in FiguresEJc) andEJf) lie outside the area correspond- 
ing to the middle image, where the appropriate parts of the background sprites 
could not be seen. 

5 Discussion and Conclusions 

In this paper, we have prese nted two systems for interactively constructing 
complex (large-scale) 3D models from multiple images. Our modeling systems 
are able to construct accurate geometrical and photo-realistic 3D models because 
our approaches have much less ambiguity than traditional structure from motion 
or stereo approaches. Our results show that it is desirable and practical for the 
modeling systems to take advantage of as many regularities and priori knowledge 
about man-made environments (such as vertices, lines, and planes) as possible 

Our panorama 3D modeling system decomposes the modeling process into 
a zero baseline problem (panorama construction) and a wide baseline problem 
(stereo or structure from motion). Using the knowledge of the scene, e.g., known 
line directions, parallel or perpendicular planes, our system first recovers the 
camera pose for each panorama, and then constructs the 3D model using all pos- 
sible constraints. In particular, we carefully partition the recovery problem into 
a series of linear estimation stages, and divide the constraints into “hard” and 
“soft” constraints so that each estimation stage becomes a linearly-constrained 
least-squares problem. 

Our layered stereo modeling system makes use of a different kind of scene 
regularity, where input images are segmented into planar layers (with possible 
small depth offsets). By drastically reducing the number of unknowns (only 3 
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Fig. 8. Results on the flower garden sequence: (a) first and (b) last input images; 
(c) initial segmentation into six layers; (d) and (e) the six layer sprites; (f) depth 
map for planar sprites (bottom strip illustrates the coding of depths as colors) 





(d) (e) (f) 



Fig. 9. Results on the symposium sequence: (a) third of five images; (b) initial 
segmentation into six layers; (c) recovered depth map; (d) and (e) the five layer 
sprites; (f) re-synthesized third image (note extended field of view). 
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parameters for each sprite plane), the recovery of 3D structure is much more 
robust than conventional stereo algorithms. 

We are working on several extensions to improve the usability and gener- 
ality of our system. Naturally, we want to automate even more parts of the 
interactive systems. For the panorama modeling system, we have implemented 
an automatic line snapping technique which snaps lines to their closest edges 
present in the panorama. We also plan to incorporate automatic line detection, 
corner detection as well as inter-image correspondence and other feature detec- 
tions to further automate the system. If we use more features with automatic 
feature extraction and correspondence techniques, robust modeling techniques 
should also be developed 0 . For the layered stereo modeling system, we plan to 
automate the interactive masking process by only specifying layers in few (e.g., 
the first and the last) images and by incorporating motion segmentation and 
color segmentation techniques. 

We are also planning to combine our layered stereo modeling system with the 
panorama modeling system. The idea is to build a rough model using panorama 
modeling system and refine it using layered stereo wherever it is appropriate 
(similar in spirit to the model-based stereo of f^). We are also investigating 
representations beyond texture-mapped 3D models, i.e, image-based rendering 
approaches H2! such as view-dependent texture maps and layered depth 
images m- Integrating all of these into one interactive modeling system will 
enable users to easily construct complex photorealistic 3D models from images. 
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Abstract. This paper presents a method for modeling the surfaces of some 3D 
scene from a set of registered range maps. The integration of range maps into a 
unique accurate representation is made tricky mainly because of the presence of 
noise in the viewpoints positions and in the range estimates. In the present case, 
the scene is captured by a CCD camera system and the depth maps are 
estimated by a stereovision technique. This approach makes the problem of 
integration particularly thorny. In fact, the range maps are generally redundant 
but corrupted by noise and not always coherent with each other. The integration 
method presented in this paper is based on a fundamental principle : whatever 
the scene is, the range maps must be consistent with each other. This principle 
is used as a constraint to discard noise and increase the 3D data accuracy and to 
identify and remove the redundancies leading to a minimal accurate 
representation. This phase is realized through the detection of inconsistencies 
between the range maps of the different viewpoints, the identification and the 
removal of the most inconsistent points, and the fusion of the remaining 
redundant points. The process is repeated until the depth maps are coherent with 
each other. Finally, the facet model is built by incrementally integrating the 
coherent depth maps. This system is independent of the depth estimation part 
and can process any set of depth maps of any scene. 



1 Introduction 

The construction of 3D models can be viewed as the problem of acquisition of 3D 
points, registration in a single reference system and integration of these points into a 
unique facet model. The resulting model must approximate the surfaces as precisely 
as possible. The data can be composed of unorganized sampled 3D points but are 
often captured through 2.5D range maps. This 2D image structure contains 
information about surface topology, and therefore usefully constrains the problem. 
The situation is that the range maps are noisy and generally overlap. Therefore, the 
construction of the model consists of a fusion process of the set of the maps as well as 
triangulation of the observed surfaces. Two approaches have been proposed for this 
process : 
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- mesh integration 

- volumetric fusion 

In the mesh integration approach, the range images are first independently 
triangulated. The triangulation is generally constrained by discontinuities in 
depth/ orientation . 

Rustihauser et al [1] and Turk and Levoy [2] create a polygon mesh for each view; 
the individual meshes are then connected to form a single mesh covering the whole 
object. In both cases, the merging procedure is a rearrangement of the facets while 
preserving the 3D points. In fact, this re-arrangement may lead to unplausible 
surfaces. 

Soucy and Laurendeau [3] divide the range maps into subsets of overlapping 
surface regions from the different views. In each subset, the redundancy of the views 
is used to improve the surface approximation. A virtual viewpoint is defined for each 
subset, the depth maps are projected on this virtual map, and the depth values are 
fused. However, the number of possible virtual viewpoints can be very high (2° for n 
maps). Moreover, too little attention is paid to the projection of the depth maps, since 
points of different surfaces can be combined. 

Pito [4] introduces a notion of consistency between triangles from different 
viewpoints. The test of similarity of two triangles is based on their relative distance 
and orientation. Then, the overlapping triangles are processed such that just one is 
kept : if two triangles are similar the one with the highest confidence is kept, if they 
are not, the occluding triangle is removed. Then several phases aim to integrate the 
meshes and to build a robust model. 

In volumetric fusion of registered range maps, Pulli et al. [5] use a hierarchical 
octree representation and estimate a surface using a hierarchical space carving method 
that labels the cubes as outside the object, inside or on the boundary. This method 
does not tackle the problem of noise in the range maps. 

Hilton et al. [6] propose an integration algorithm based on a continuous implicit 
surface representation that does not compensate for noise. 

Wheeler et al.’s method [7] merges the set of range maps into a volumetric 
implicit-surface representation. The method called the consensus-surface algorithm 
deals with noise by requiring a quorum of observations before using them to build the 
model : the signed distance to the object surface is estimated by finding a consensus 
of locally coherent observations (in terms of 3D location and orientation) of the 
surface. A consensus surface is then derived by merging the selected observations and 
a surface mesh is obtained using a variant of the marching-cubes algorithm. As said in 
the paper, the algorithm assumes that it is possible to separate the noise from the 
relevant data by means of a threshold on the consensus value. 
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2 Presentation of the Context 

Most of the techniques described above are applied to data issued from range 
scanners. In 3D reconstruction from passive vision, the presence of noise is 
particularly critical and a robust technique is required to integrate the range maps in a 
unique representation. Researchers have developed methods to build surface models 
in special cases (e.g. deformable models [8], elevation models [9]). The outliers 
detection and removal have been tackled by locally modeling the surfaces around the 
3D points [10]. 

In the present case, a set of depth maps is estimated by means of a vision-based 
system [11]. This system exploits the luminance images provided by a single camera 
moving in front of an arbitrary static scene. The resulting depth maps are dense and 
redundant with each other. They are also noisy and not everywhere coherent with 
each other. In particular, due to the correspondence problem, erroneous areas may 
occur. Such noise is intra-frame and even inter-frame correlated, sometimes leading to 
a high consensus of 3D points for wrong surfaces. 

The idea to use a volumetric approach is excluded : the scene can be complex, the 
density of the 3D observed data is generally irregular (the resolution depends on the 
shape of the scene as well as on the acquisition conditions). It makes noise removal 
through a simple density-based criterion difficult. Moreover it makes keeping the 
image resolution of the scene difficult. 

Therefore we use the image grid as the support of the data and of the processing. 
The depth maps are supposed to be registered either by pre-calibration, on-line 
calibration or by robust registration such as in [12]. The integration method must be 
able to solve the inconsistencies between the range maps. 

Two modeling approaches have been developed with a particular attention paid to 
solve the inconsistencies existing between the depth maps : the first one incrementally 
integrates the depth maps in the model and manages at each time instant the 
differences between the model and the current depth map [13]. In this method, the 
current model is projected on the new viewpoint and compared to the corresponding 
depth map. The new areas are approximated and triangulated, while the old ones are 
processed differently according to the similarity between the current depth map and 
the 3D model : if this similarity is high, they are simply adjusted; if there is a conflict, 
they are either maintained or corrected and retriangulated. 

There is a conflict in presence of erroneous data. The problem is to identify what is 
erroneous on a given pixel : the 3D model or the current depth map? There is no 
robust local test to choose between the two candidates. In [13], a luminance-based 
criterion was defined to choose between the two values but some tests showed that 
this criterion is not reliable and does not solve the problem satisfactorily. 

In the second approach [14], the range maps are all combined in order to detect 
and remove the outliers and to produce coherent maps; then, a 3D facet model is built 
by integrating these maps. This more robust approach has been recently improved and 
is presented in the following sections : what is an inconsistency is defined in section 
3; the consistency processing is detailed in section 4 ; section 5 shows a set of results 
and gives the conclusion. 
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3 Definition of Inconsistency 

The consistency processing is realized on the different viewpoints of the sequence, 
and relies on the properties of the 2D 1/2 representation as provided by the depth 
maps. The objective is to make the depth maps consistent with each other and 
accurate before their integration in a unique 3D model. First, it is required to detect 
and solve the inconsistencies. What is an inconsistency ? Let us consider a given 
viewpoint. A depth map is assigned to this viewpoint. This depth map describes 3D 
points that are surface elements. The line joining the camera center to any 3D point of 
the depth map should not cross any other surface as this point is seen from this 
viewpoint. Therefore, there is an inconsistency when a surface seen from another 
viewpoint hides the surface seen on the current viewpoint when projected on it. This 
is the basic principle of inconsistency detection. 

Figure la displays two examples in which two depth maps are consistent with each 
other : 

in the first case, the two viewpoints see two different areas, and surface B is not 
incompatible with what viewpoint 1 sees (surface A). 

in the second case, viewpoints 1 and 2 see the same surface, that is called A1 from 
viewpoint 1, and A2 from viewpoint 2. The surface measurements from viewpoints 1 
and 2 are similar and therefore coherent with each other. 



Viewpoint 2 




Fig. la. two examples of inter-image consistency 

Figure lb displays an example in which the two viewpoints see inconsistent data. 
Viewpoint 1 sees surface A1 while viewpoint 2 sees surfaces B2 and C2. The problem 
appears if what is seen from viewpoint 2 is observed from viewpoint 1 . In this case, it 
appears that surface C2 hides surface A1 from viewpoint 1 : there is inconsistency. At 
least one of the two surfaces is false. 
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Let us notice that on the other hand if what is seen from viewpoint 1 is observed 
from viewpoint 2, there is no inconsistency : surface A1 is simply hidden by surface 
C2. Consequently, the detection of inconsistencies between two viewpoints requires 
to observe the two sets of data from the two views. 



Surface A1 




Surface B2 



Surface C2 




Viewpoint 2 



Viewpoint 1 



Fig. lb. example of inter-image inconsistency 



4 Consistency Processing 



4.1 Overview of the Algorithm 

The integration method is based on the principle that whatever the scene is, the range 
maps must be consistent with each other. This strong constraint is introduced to build 
a unique representation of the scene from the input noisy data. The objective is to 
estimate a new set of depth maps that are consistent with each other, as close to the 
input set as possible and locally smooth. The estimation problem can be formulated in 
terms of minimization of a set of three energy functions : 

• that measures the inconsistency between the depth maps 

• E, that measures the distance of the resulting depth maps to the observation 
maps 

• E^ that introduces a weak smoothness constraint 

Therefore, the objective is to estimate a set of filtered depth maps z() that minimize 
the global energy E : 

min E = min ( E H- E, H- E ) 

z z V c t s' 
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Let us consider a set of N depth maps : each depth map is described by z(p,i), in 
which z is the depth component, p is a given pixel of image i : 

• the distance measure E, is defined by the sum on all the pixels p of all images i of 
the distances between the observed depth values and the resulting ones : 

Ef = 2 ” I./ c(p,i). 1 1 Zf<p,i) - z„b,(p,i) 1 1 ^ 

where c(p,i) is the confidence value assigned to Zobs(p,i)- 

• the consistency measure E_, is defined from the comparison of each depth map with 
the others : 

E = E ” If EpP p(p,i). 1 1 zt<p,i) - zr^(p,i,j) 1 1 ^ 

where Zf*’™\p,i,j) is issued from the projection of depth map j on i, 

and p(p,i) =1 if Zf(p,i) and Zf‘’™^(p,i,j) correspond to the same surface, 

=0 otherwise. 

• the smoothness measure E^ is defined by : 

E =E”Ep’’|| zip,i)-0(Zf(p,i))IM 

where <I>() is a local 2D operator on the depth map i. This intra-image 
constraint is mainly used to remove local artefacts. 

The consistency processing is realized in two steps (figure 2) : 

• inconsistency evaluation and outliers removal 

• multiview fusion of the consistent data. 




Fig. 2. consistency processing 
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4.2 Inconsistency Evalnation and Ontlier Identification 

In this section, we will answer the question how to detect and evaluate the 
inconsistencies in the set of depth maps and how to identify the outliers. 

Inconsistency detection and evaluation. The inconsistencies between two 
viewpoints i and j are detected by comparing the depth map RMi of viewpoint i and 
the « projected » depth map PMJi obtained by « projecting » the depth map RMj on 
viewpoint i. Inconsistencies are detected by comparing each pixel of RMi to the 
corresponding one in PMji. The depth difference AZ is compared to the threshold X : 
if -X<AZ<X , the two points are supposed to belong to the same surface : there is 
consensus (consistency case) 

if AZ<-X , the projected point is hidden by the current point (consistency case) 
if X<AZ , the projected point hides the current point : there is inconsistency. 

X defines the separation threshold on the depth difference AZ between two surfaces. 
In case of inconsistency, the two points on the input depth maps RMi and RMj are 
potentially erroneous (at least one of them is wrong) and both are consequently 
penalized. In case of consensus, both points are « rewarded ». For each pixel, these 
confidence values are accumulated through the comparison to all the other maps and 
make up inconsistency maps. The procedure InconsistencyDetection describes the 
comparison between two views : 

Procedure InconsistencyDetection 
Input: range maps RMi and RMj of viewpoints i and j 
Input: transformation matrix T(j,i) between viewpoints i and j 
Input: outlier maps (1,1) 

Output: inconsistency maps (Cj,o) 

FOR PjE 1...P 

IF ( l(p.) = 1 ) /* Pj is not an outlier */ 

(n,v,Zj) <- T(j,i) * (pj,RMj(pj)) /* (n,Vj) : 2D image components */ 
Pj=closestpixel(Uj,Vj) 

PMji(Pj)=Zj 

IF ( li(pi) = 1 ) /* p- is not an outlier */ 

IF ( AZ= I RMi(p,) - PMji(p.) I < X ) /* consensus */ 

Cj(pj) = C-(pj) + conf^ 

Cj(Pj) = Cj(p,) + conf^ 

FFSF IF ( AZ= ( RMi(pj) - PMji(p.) ) > X ) /* inconsistency */ 

c,(p.) = Cj(p,) + conf 
Cj(Pj) = c/Pj) + conf 

Fnd 

The outlier maps discriminate the already identified outliers that have been discarded 
(1_,=0) from the other points that are still processed (1,=1) (see below). 

Outlier identification. Once all the comparisons have been done, the inconsistency 
maps are processed to classify each pixel as consistent/inconsistent. A first possible 
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solution is to apply a threshold. However, as mentioned previously, in case of 
inconsistency between two points, both are penalized. Consequently, thresholding the 
inconsistency maps does not guarantee an ideal discrimination between the surface 
points and the outliers. In fact, we process the inconsistency detection and the outlier 
removal iteratively, starting with a high threshold p that is progressively decreased : 
the suppression of the highest penalized points modifies the inconsistency maps that 
are re-evaluated (some inconsistent points can become consistent). As an example, the 
threshold can be chosen at each iteration so that it suppresses approximately the 10% 
of remaining points that are the most inconsistent. The process is applied until all 
depth maps are consistent with each other. 

MainProcedure Inconsistency Evaluation 
FOR ie 1...N 
FOR pG 1...P 

l,(p)=l /* all points are initially valid*/ 

DO 

FOR iG 1...N 
FOR j G 1...N, j ^ i 
Inconsistency Detection 
Compute_(x 
FOR iG 1...N 
OutlierMap 
WHILE a?t0 
End 

Procedure OutlierMap 

Input: inconsistency map c. 

Output: outlier map 1, and number of outliers a 
a=0 

EOR PG 1...P 

IF ( c.(p) > p ) 

l.(p)=0 /* p is classified as outlier */ 

a = a-tl 

End 

4.3 Multiview Fusion 

The consistent pixels in the depth maps are updated through a multiview fusion 
process. The points to be merged are identified by a projection step similar to the 
previous one. The depth components are then merged on each viewpoint taking into 
account the confidence maps attached to the observations. The surface orientation is 
also considered in the fusion process through the mesh of microfacets : a microfacet 
links three neighboring pixels of the Image grid. The orientation of the projected 
microfacets is computed and their attached points are selected only if the orientation 
is right with regard to the viewpoint on which fusion is processed. A locally planar 
surface model is then applied to the resulting depth maps in order to smooth the 
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surfaces and remove isolated noise. The resulting depth maps are more accurate than 
the input depth maps and more consistent with each other. 

MainProcedure Multiview Fusion 
FOR ie 1...N 

FOR j e 1...N, j;^i 

MicroFacetProjection 
FOR p,e 1...P 
Fusion 

End 

Procedure MicroFacetProjection 

Input: range map RMj, confidence map CMj and outlier map I of viewpoint] 

Input: transformation matrix T(j,i) between viewpoints i and j 
Output: projected range map PMJi of range map j on viewpoint i 
Output: orientation map OMJi and confidence map CMji attached to PMji 
FOR PjE 1...P 
PMji(p.) = oo 
FOR PjG 1...P 

(u,Vj,Zj) <- T(j,i) * (Pj,RMj(pj)) ; /* (u,Vj) : 2D image components */ 
Pj=closestpixel(Uj,Vj) 

IF(z,<PMJi(p.)V 
PMji(p-) =Zj 
CMJi(p') = CMJ(p,) 

IF at least one of the projected microfacets attached to p has the right orientation 
OMji(p.)=l 

ELSE OMji(p.)=0 

End 

Procedure Fusion 

Input: RMi, CMi, PMji, CMji and OMji 
Output: fused range map FMi 
EOR p,G 1...P 

IF ( RMi(p-) oo ) 
conf=CMi(pj) 
z_fus = RMi(pj)*conf 
FOR j G 1...N, j i 

IF ( I RMi(p.) - PMji(p,) I < X AND OMji(p.)=l ) 
z_fns = z_fus + PMji(Pj) * CMji(pj) 
conf = conf + CMji(p-) 

FMi(p-) = z_fns / conf 



Return Fmi 
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4.4 Geometry and Texture Representation 

The multiview consistency processing described above provides coherent depth maps. 
It allows to keep the image resolution unlike the volumetric fusion techniques. From 
the resultant depth maps it is now possible to distinguish the various surfaces and to 
remove the redundancies. The next step is to build a compact representation of the 
scene. Two representations are possible and are presently nnder investigation for 
comparison : 

• a geometry-based representation: it is an incremental construction of a 3D facet 
model that is made relatively simple once the depth maps are consistent with each 
other. 

• an image-based representation (RGBZ) that is straightforward in our case. 

The performance criteria are the compression ratio and rendered image qnality. 



5 Results and Conclusion 

Synthetic scene. Figure 3a shows the top view of an original synthetic scene. A set of 
284 depth maps 512rows*101ines have been created by moving a virtual camera in the 
scene; figure 3a displays a cross-section of the 3D scene. In a second step, noise has 
been added to the depth maps and fignre 3b displays a cross-section of the 3D scene 
recovered from the set of the noisy depth maps. 

The consistency processing has been applied on this set of images and the results of 
the various steps are shown in fignres 4 : the 3D scene is shown once the outliers have 
been discarded (figure 4a) and at the end of the fusion (figure 4b). For comparison, 
fignre 5 displays the results of a point rejection based on thresholding the 3D density 
of the points : some wrong snrfaces remain while some right ones are removed that 
shows the unreliability of such discrimination. The inconsistency combined with an 
adaptive thresholding is a better discrimination parameter. 

Real scene. Fignres 6 show an example of images acquired by a moving camera. The 
seqnence Carton is composed of 18 images, the camera motion is mainly vertical. 
Fignres 6a, 6b and 6c correspond respectively to images n°l, 9 and 18. A joint 
recursive estimation of depth maps and camera motion is realised along the sequence. 
The depth maps corresponding to fignres 6 are shown in fignres 7. They are noisy and 
not everywhere coherent. The results of the consistency processing are displayed in 
fignres 8. They clearly show the improvement bronght by the processing. In 
particular, the inconsistencies on the top-right of the initial depth maps have been 
solved. Figures 9 display the interpolation of image n°10 from the images n°l and 
n°18 : after consistency processing of the depth maps in figure 9a and from the 
original depth maps in figure 9b. 

In conclusion, we have built a structure from motion system for arbitrary scenes and 
arbitrary small motions. A first step [11] provides a set of range maps with attached 
camera position. The second step presented in this paper is a robust method that 
provides coherent depth maps: the multiview consistency processing. In fact, this 
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processing is independent from the other parts of the system : it can be introduced in 
any system with a set of input depth maps in order to make them consistent. 
Moreover, the consistency property is a strong constraint that can be added to the 
smoothness constraint in interpolation of sparse depth maps. Luminance is also 
planned to be taken into account to further constrain the processing. 
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Abstract. In this paper we present a method for adapting a geomet- 
rical deformable model (GDM) to a set of registered range images in 
order to reconstruct real-world objects from multiple range images. Our 
approach registers the range images simultaneously, carves out an inter- 
mediate volume and finally generates an accurate, sparse triangle mesh. 
The proposed GDM scheme refines an initial roughly approximated mesh 
by deformation and adaptive subtriangulation. Even in the case of very 
large data sets our approach presents an efficient method of surface recon- 
struction due to adaptive improvement to the desired degree of accuracy. 
Since the root mean square approximation error of each triangle is min- 
imized in an iterative procedure, the mesh quality is higher than that of 
previous approaches. 



1 Introduction 

The reconstruction of complete object geometries with a 3D scanner device is 
generally not possible within one scan. Instead, the object has to be scanned 
from several directions in order to capture its complete geometry. The resulting 
range images must be registered and can subsequently be integrated into a model 
of the object surface. 

Due to some amount of noise in the original data and due to partially in- 
complete captured surface portions, there is a need for interpolating the object 
surface at falsified or undefined gaps. A uniform approach solving this problem 
is the geometrical deformable model (GDM). The GDM was first described by 
Miller et al. for the segmentation of volumetric data sets. Basically, a GDM 
is a triangle mesh that dynamically deforms by moving each mesh vertex in 
the direction of steepest descent along the surface of a cost function. The cost 
function integrates all constraints on the shape and position of the mesh into a 
consistent mathematical model. By minimizing the total costs the best solution 
considering all constraints is achieved. 

However, a crucial drawback of this approach is the smoothing of fine details 
of the surface even in regions where it is defined properly, e.g., sharp edges. It 
is caused by improper weighting of internal cost terms which are intended to 
preserve the mesh smoothness and topology. We propose a deformation scheme 
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Fig. 1. (a) plaster bust of composer Richard Wagner (b) 3D model reconstructed 
from 27 scanned range images 



that moves the vertices under constraint to minimize the external cost term 
exclusively. The presented optimization procedure achieves high quality of the 
mesh by moving the vertices along two types of forces, a spring force and an 
expansion force. The spring force maintains the mesh regularly whereas the 
expansion force drives the mesh towards the surface. 

The remainder of this paper is organized as follows. Section gives an 
overview of the processing steps. Section 0 discusses the definition of the im- 
plicit surface from multiple registered range images. Section ^ deals with the 
generation of the template mesh. Section El discusses topological improvements 
of the mesh applied during the deformation process. Section El presents two ap- 
proaches to improve the vertex positions of a given mesh: a fast one and the 
proposed GDM approach. In section Q results are shown and discussed. 

2 Overview 

The processing steps and intermediate representations that yield an accurate, 
sparse triangulated surface starting from a set of range images is shown in Fig. 
El The first step is the registration of range images [Z| and results in the model 
cluster. The cluster holds the information needed to define an implicit surface, 
i.e., the range images and their attached transformation matrices. For the cal- 
culation of Euclidean distances to the surface and for higher performance it also 
contains a distance transformed volume that assigns every voxel its distance to 
the nearest surface point. In a subsequent step, a binary volume is sculptured. 
By application of the marching cube algorithm a template triangle mesh is gen- 
erated from the volume. By using an octree representation during sculpturing we 
are able to generate a volume of arbitrary resolution. Hence it is possible to con- 
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struct meshes with the required number of triangles and surface topology. This 
template is then adapted to the surface by deformation and subtriangulation. 




sculpturing 



distance 

information 



binary volume 



triangulation 



template mesh 



deformation 
& subtriangulation 



GDM 



Fig. 2. Pipeline for adapting the GDM 



3 Definition of the Implicit Snrface 

3.1 Distance Function 

Initially, the surface of the object to be reconstructed is given by a number 
of range images that are already registered. The parameter grid of the range 
images is defined by the coordinate system of the scanner, e.g., a cylindrical or 
a perspective system. By interpolating between the grid points the surface can 
be continuously completed. 

Now, we convert the range images to a signed distance function. For the 
definition of the distance we define for each range image a function gfx) that 
measures the signed projection distance between a point in space and the inter- 
polated surface in the range image. The corresponding point in the range image 
is found by projecting the point x onto the parameter grid of the range image. 
Thus, a positive distance indicates that the point lies between the scanner and 
the surface and consequently is visible whereas a negative distance indicates that 
the point lies below the surface and is invisible. 

The synthesis of multiple range images is achieved by combining the functions 
gfx) as shown in equation O- 

f{x) = TatxiL{gfx)} (1) 

% 

The implicitly defined surface is given by the zero crossings of f{x). It is now 
possible to calculate the intersection between an arbitrary ray and the surface 
of the object. 
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3.2 Distance Transformation 

The distance function discussed in the previous subsection calculates the projec- 
tion distance, which has the same zero crossings as the Euclidean distance. The 
function consequently is suitable for the definition of the implicit surface and 
for the definition of the visibility of a point. However, in order to approximate 
the Euclidean distance from a point to the nearest surface point, we propagate 
distances into space by calculating a distance transformed volume. This has the 
positive side effect that it also reduces the effort of calculating the distance, 
which otherwise depends linearly on the number of range images. In order to 
prevent loss of information the projection distance within the voxels that con- 
tain portions of the surface can be calculated additionally. 

The distance transformed volume is generated by a floating point number 
based chamfering distance transformation. The distance is at first defined only 
for voxels that contain parts of the object surface. For initialization of these 
voxels the projection distance to the center of gravity of the voxel is calculated 
and stored to the voxel. By application of a two pass transformation algorithm 
IP the distances are propagated successively into the neighborhood. After the 
distance transformation is completed, the invisible voxels are defined as to be 
negative. 



4 Generating the Template Mesh 

In order to achieve a first approximation of the object surface and to derive the 
topology of the object up to the desired level of detail, we use a sculpturing 
approach to build an intermediate volumetric model. For each voxel x of the 
volume the distance function f{x) can be evaluated and we thus obtain the bi- 
nary decision whether the point does belong to the object. In some cases, there 
remain some volumetric regions that do not belong to the object because the 
associated voxels e.g., lie on the back-face of the object, far below one of the 
range images defining the front side. However, after generating the intermediate 
volumetric model, the marching cube algorithm P with a look-up table that re- 
solves ambiguous cases [0| can be applied to generate a polygonal representation. 
In order to delete wrong volumetric regions all connected meshes are detected 
and all but the largest mesh are deleted. The accuracy of this polygonal mesh is 
improved by moving the vertices of the mesh onto the surface implicitly defined 
by the registered range images p. 

5 Improving the Mesh Topology 

To be able to approximate fine details of the surface our scheme refines the 
grid at surface portions with high curvature and removes triangles where the 
reconstructed surface is nearly flat. This benefits for a compact representation 
and accelerates operations performed on the mesh. We implemented the following 
operations. 
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5.1 Deletion of Short Edges 

By merging those points connected by very short edges and deleting the corre- 
sponding triangles, the number of triangles can be reduced very easily. 



5.2 Deletion of Redundant Points 

For this operation mesh vertices are deleted if the normal of this vertex just 
differs slightly (i.e., within a predefined angle) from the normal of the neighboring 
vertices. Subsequently the surrounding polygon is retriangulated by a simple 
traversal algorithm. 



5.3 Subdivision 

Triangles are splitted into a number of faces if the distance of one of the centers 
of gravity of the three edges or of the center of gravity of the triangle is larger 
than a specified threshold value. The new vertices are found as the intersection 
of the mesh normal with the implicit surface. After determination of the point 
locations one of the subtriangulation schemes of Fig. Elis chosen and the splitted 
triangle is replaced by the new triangles. 




Fig. 3. Subdivision configurations without symmetric cases 



5.4 Edge Swapping 

The tessellation of a number of vertices is not unique. In order to generate the 
globally best triangulation local optimization can be applied. A triangulation is 
globally optimal iff it is locally optimal and a global criterion is improved in each 
local optimization step. A triangulation is called locally optimal iff a predefined 
criterion is true in each quadrilateral (2 triangles with one common edge). 
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Fig. 4. Optimization by edge swapping 



The criteria for performing the edge swapping (see Fig.^ are as follows: 

— The triangulation of a quadrilateral is optimal iff the smallest angle is greater 
than the smallest angle present in the alternative triangulation of the quadri- 
lateral {Max- Min- criterion). 

— The total area occupied by the triangles should not increase. 

— The approximation error should not increase. 

The edge swapping is activated for all quadrilaterals which do not fulfill the 
criteria mentioned above. This step is repeated for all quadrilaterals until no 
more improvements can be made. 

6 Improving Vertex Positions 

6.1 A Fast Approach: Smooth and Reproject 

The initial triangulation can be improved by shifting the vertices of the mesh 
onto the center of gravity of the surrounding polygon (smoothing) . We then de- 
fine a ray that runs through the shifted vertex and parallel to the mean normal 
of the neighboring triangles. The vertex is now re-projected onto the surface by 
finding the nearest surface intersection. Hence we derive triangles of approxi- 
mately equal size and inner angles. The visualization of the object appears to 
be greatly improved as small noise in the vertex coordinates only slightly influ- 
ences the surface normal of the triangle. On the other hand, vertices may be 
delocalized apart from small step edges of the surface. 

6.2 GDM 

Force Definition The deformation of the GDM from the initial template is 
driven by the simulation of two forces. On one hand the edges act like springs. 
According to equation (EJ the spring forces are normalized in order to yield 
equilateral triangles. 




spring 



{i.j)GVertices 





( 2 ) 
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Fig. 5. Three stages of the adaption process with (a) 1300 (b) 7500 (c) 16000 
triangles 

On the other hand a pressure force defined along the surface normals causes 
a smooth deformation of the GDM as it is done for reprojection. 

Optimization Procedure The deformation of the GDM is performed itera- 
tively by moving each vertex along the direction of each force. The step size 
for this is chosen individually for both forces at each vertex. Two strategies are 
sensible. 

1. As a first step, move the vertex along the spring force. The step size has 
to be smaller than the distance value of the starting point and shorter than 
the distance to the center of gravity of the surrounding polygon. If a defined 
distance value calculated for all adjacent triangles increases, this step is not 
performed. In a second step, move the vertex along the positive normal 
direction. The step size has to be smaller than or equal to the distance value 
of the starting point. If the distance value of the adjacent triangles increases, 
the negative normal direction is tested. 

2. Since the second step of the above procedure assures for minimizing the 
distance of the adjacent triangles to the surface, the constraint for the move- 
ments along the spring force can be relaxed. This is done by a stochastic ap- 
proach. We adapt an acceptance criteria from simulated annealing 0 which 
is shown in equation (0 . If the probability Pij is larger than an equally dis- 
tributed random number in the interval of [0 ... 1) the new state is accepted, 
otherwise rejected. 

For the results shown in the next section the latter method has been used. 
In equation OT indicates the temperature of the system and h{.) indicates the 
euclidean distance to the surface as mentioned in section O 




1 if h{xi) < h{xj) 

e~(h{xi) - h{xj)/T h{xi) > h{xj) 



(3) 
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Fig. 6. Approximation error for both vertex optimization methods 



In order to calculate the distance of an arbitrary point to the surface, the 
distance transformed volume is sampled at each vertex point and additionally 
at the center of gravity of each triangle. From these sample points the root 
mean square distance is calculated. As can be seen later the minimization of this 
average value results in high quality surface approximation. The optimization 
procedure terminates if the average vertex movement is below a given threshold 
value. The adaption algorithm is summarized as follows: 

Adapt Mesh (T) 

{ 

loop { 

loop { 

remove short edges (T); 
remove redundant points ( T ); 
swap edges (T); 

} until no more vertices are removed; 

improve vertex positions ( T ); 

subdivision (T); 

swap edges (T); 

improve vertex positions ( T ); 

} until the required accuracy is reached; 

} 



7 Results 

Results are presented for a bust of the composer Richard Wagner. The bust 
was reconstructed from 27 range images. The topology of the bust and a first 
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approximation of its shape was sculptured in a volume of 23 x 25 x 16 voxels. Fig. 
0a) shows the triangulation generated with the marching cube algorithm. Small 
edges have already been eliminated. Afterwards our GDM adaption procedure 
was applied to the data set as it is presented in Fig. 0b) and (c). 

Simultaneously the root mean square approximation error has been calcu- 
lated as distance value of each triangle center of gravity resp. vertex point during 
each subdivision step. As can be seen from Fig. the GDM approach leads to far 
lower approximation errors than the faster approach of smooth and reproject. 
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Abstract. Augmented reality (AR) is a technology by which a user’s 
view of the real world is augmented with additional information from a 
computer model. AR applications require a very accurate model of the 
environment (a reality model) to augment the current view seamlessly 
with synthetic information (the virtual model). In this paper, we report 
on the problems we encountered with image data from real exterior con- 
struction sites. We discuss quality requirements for reality models in 
order to be useful in AR applications, and we outline potential further 
needs for reality models. 



1 Introduction 

With AR technology, users can work with and examine real 3D objects while 
receiving additional information about those objects or the task at hand HOI 
El ED The virtual objects need to coexist in physically plausible manners with 
the real world: they occlude or are occluded by real objects, they are not able 
to move through real objects, and they cast shadows on such objects. 

The automatic construction of reality models is a long-standing issue in com- 
puter vision research. In the context of the European CICC-project E|, we ex- 
plore very applied, pragmatic approaches which are closely related to the re- 
quirements of rather realistic application pilots in the exterior construction. 
Other approaches towards semi-automatically generating architectural models 
from images have been reported by Debevec |2j and by Faugeras m- 



1.1 Reality Models for Exterior Construction Applications 

Exterior construction applications impose very demanding challenges on the ro- 
bustness and usability of evolving AR technologies. Real construction sites are 
huge. Information from many views, at close range and from long distances, has 
to be integrated. Furthermore, construction environments are not well struc- 
tured. Natural objects such as rivers, hills, trees, and also heaps of earth or con- 
struction supplies are scattered around the site. Typically, no exact detailed 3D 
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information of such objects exists making it difficult to generate a precise model 
of the site. Even worse, construction sites are in a permanent state of change. 
Buildings and landscapes are demolished, new ones are constructed. People and 
construction equipment move about, and the overall conditions depend on the 
weather and seasons. Reality modelling and reality tracking are thus very complex 
and demanding tasks. AR applications thus need to identify suitable simplified 
approaches for generating and dynamically maintaining appropriate models of 
the real environment. 



1.2 AR Applications in Exterior Construction Projects 

The first approach augments video sequences of large outdoor sceneries with 
detailed models of prestigious new architectures, such as TV towers and bridges 
that will be built to ring in the new milleneum (see Figure QJi). Since such 
video sequences are very complex, we currently pre-record the sequences and 
employ off-line, interactive calibration techniques to determine camera positions. 
Given all calibrations, the augmentation of the images with the virtual object is 
performed live, i.e., the virtual model can be altered and transformed while it is 
being seen in the video sequence. 

The second approach operates on live video streams, calibrating and aug- 
menting images as they come in. To achieve robust real-time performance, we 
need to use simplified, ’’engineered” scenes. 




(a) Virtual bridge across 
a real river. 



(b) Virtual wall and grid 
in a real room. 




Fig. 1. Interactive vs. automatic video augmentation. 



In particular, we place highly visible markers at precisely measured locations 
to aid the tracking process (Figure^). Such use of special markers is becoming 
a common practice [T71E3121E1] 
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2 Reality Models 

For optical camera calibration and to enforce physical interaction constraints 
between real and virtual objects, augmented reality systems need to have a 
precise description of the physical scene. Reality models don’t need to be as 
complex as, for example. Virtual Reality (VR) models. VR models are expected 
to synthetically provide a realistic immersive impression of reality. Thus, the 
description of photometric reflection properties and material textures is crucial. 
AR, on the other hand, can rely on live optical input to provide a very high sense 
of realism. However, AR reality models have to be much more precise than VR 
models since users have an immeditate quantitative appreciation of the quality 
of the integration between reality and augmentations. 



2.1 Use of Existing Models 

The most straightforward approach to acquiring 3D scene descriptions is to use 
existing geometric models, such as CAD data, output from CIS systems, and 
maps. 





(a) Building under construction (b) Virtual model 

Fig. 2. Building under construction at the Expo 98 site in Lisbon. 



Once a building has entered the construction phase, the virtual model of the 
building itself can begin serving as a reality model, as shown in Figure El 

Yet, they cannot always be used for AR. The data in CAD models doesn’t 
necessarily coincide with discernible features in images. Furthermore, the models 
often don’t describe the evolving reality of a construction site in enough detail. 



2.2 Manual Approach 

The manual approach involves obtaining 3D measurements within the real world, 
using data bases and physical instruments. The 3D data points are entered into a 
small model. The approach works well when only very sparse reality models are 
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needed but it cannot be used to generate elaborate descriptions of the real world. 
Figure shows a thus generated reality model of our ’’tracking laboratory”, a 
room with several carefully measured targets on its walls. 




(a) Laboratory setup. 



(b) At a real construction site. 



Fig. 3. Example and use of manually created reality models. 



Figure Et shows use of a similar simplistic reality model of a small area at 
the Bluewater construction site in Kent, UK. The model data was collected with 
a 3D laser pointer that was attached to a differential GPS system. The location 
of the upper right corner of each black square was determined by orienting the 
3D laser pointer at the square, thus yielding the orientation and distance (time- 
of-flight) between the square and the pointer. 

Yet, the approach is prohibitively time-consuming, if thousands of points 
are to be measured this way to generate suitable surface approximations for 
occlucion handling. Furthermore, the approach depends upon the availability 
of professionals and special equipment. Thus, models cannot be expected to be 
obtainable on short notice. 



2.3 Interactive Approach 

In the European CICC project, we build a very sparse initial model of a landscape 
from external information such as maps and geodesic measurements. Our system 
helps us to interactively extend this model by superimposing it on a calibrated 
image. Models of new objects can then be entered into the model, using their two- 
dimensional position in a map and estimating their height from their alignment 
in several images. 

From this model, we generate an initial camera calibration for a few site 
photos, interactively indicating how features in the image relate to the model 
(see Figure 0^). 

Once an image has been successfully calibrated, the model is overlaid on the 
image, showing good alignment of the image features with the model features. 
Figure03 illustrates the calibrated insertion of a new house into the model, using 
an initial much too large guess for the height of the building. 
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(a) Initial model. (b) Interactively extended model. 

Fig. 4. Interactively created and extended reality model of the city of London. 



2.4 Towards Automatically Generated Models 

Computer vision techniques are designed to automatically acquire three-dimen- 
sional scene descriptions from image data. Much research is currently under way, 
exploring various schemes to optically reconstruct a scene from multiple images, 
such as structure from motion PEIEHIESI) (extended) stereo vision 0C1ISI, 
and photogrammetric techniques Ca- 
in the context of the European project Cumuli |3, we explore to what extent 
automatically generated scene models can support AR and VR applications. In 
collaboration with INRIA and Lund University, we are developing and testing 
tools which exploit epipolar relationships between features in several images, 
geometric constraints on architectural structures, as well as city maps, to deter- 
mine a set of progressively more precise projective, affine and finally euclidean 
properties of points in the three-dimensional scene. 

Figure El shows a reconstructed model of the Arcades of Valbonne. Figure 
Efe. shows the reconstructed geometric model. In Figure Eb, the model has been 
enhanced by mapping textures from the original image data onto the surfaces. 
Figure Efc illustrates how photo of the area can be augmented with synthetic 
objects, such as a Ferrari, once the images have been analyzed and calibrated. 

2.5 Range Data Models 

Alternatively to motion-based scene recognition, the RESOLV project uses three- 
dimensional range sensors to conduct a 3D survey of a building. The environment 
is scanned from a number of capture positions and reconstructed into a model, 
unifying measurements from all viewing positions (Figure |^. Surfaces are rec- 
ognized by processing the range data and are textured from camera images. 

2.6 Use of Reality Models for Camera Calibration 

Precise camera calibration is a key issue in AR. Calibration algorithms are in- 
herently sensitive to noise and to specific alignments of features in the reality 
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(a) Geometric model. (b) Enhanced with texture maps. 




(c) 2D picture of the plaza, augmented with a Ferrari. 

Fig. 5. Automatically generated model of the arcades of Valbonne. 



model. For example, houses in cityscapes tend to be aligned along a road or river. 
Many target features are thus approximately coplanar - considering a distance 
of several hundred meters between the camera and the houses - and cannot sup- 
ply good 3-dimensional depth cues for camera calibration. In our work, we are 
emphasizing pragmatic concepts to cope with such real problems. 

— Reality models should use targets in a nicely spread three-dimensional vol- 
ume. For example, the inclusion of distant high rises and power poles in a 
model can greatly stabilize calibration results. The targets also need to be 
easily detectable and precisely locatable in image data. 

— To help users correctly position image features, our system determines au- 
tomatically which image feature currently has the largest influence on a 
calibration misalignment. By moving that feature by one pixel up, down, 
left, or right, a new calibration generates a much smaller mismatch between 
image features and projected scene features. 

— We use as much externally available information as available, such as known 
internal camera parameters. Our algorithm can be further constrained when 
the approximate camera location and orientation is known from other track- 
ing devices uni 
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(a) (b) 



Fig. 6. a) RESOLV trolley b)Automatically generated model of part of the in- 
terior of the Royal Institute of Chartered Surveyors, London. 



3 Augmenting Reality 

With AR, such virtual geometric objects can be integrated into the real envi- 
ronment during all phases of the life cycle of a building. Before the construction 
project is started, AR can support marketing and design activities to help the 
customer visualize the new object in the environment (Figure^. During con- 
struction, AR can help evaluate whether the building is constructed according 
to its design (Figure E|- 




(a) Original scene. (b) Augmented with planned footbridge. 



Fig. 7. Side view of a new footbridge, planned to be built across the river Wear 
in Sunderland, UK. 
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Fig. 8. A virtual wall at a real construction site 




(a) Original image. (b) X-Ray view into the wall. 



Fig. 9. Seeing the piping in the wall. 



After the construction is completed, maintainance and repair tasks benefit 
from seeing hidden structures in or behind walls (Figure EJ- 



3.1 Occlusion Handling Using Geometric Reality Models or Depth 
Maps 

Occlusions between real and virtual objects can be computed by geometric ren- 
dering hardware by first drawing the reality model transparently and then ren- 
dering the virtual objects. Other mixing approaches initialize the Z-Buffer from 
depth maps obtained with a laser scanner or stereo computer vision. As a re- 
sult, the user sees a picture on the monitor that blends virtual objects with 
live video, while respecting 3D occlusion relationships between real and virtual 

objects (Figure El- 
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Fig. 10. Virtual London bridge reflecting in the real water (partially occluded 
by the houses at the far side of the Thames) . 



3.2 Simulation of Illumination Effects and Physical Constraints 

Objects in the real world not only determine their own shading, they also have 
an influence on the appearance of other, distant objects by means of shadows 
and reflections m- 

With the availability of reality models, the geometry of shadows cast by 
virtual objects onto real ones can be computed |23. Reflections are a more 
difficult topic that can be solved for many useful special cases, such as reflections 
of virtual objects in planar real mirrors (Figure Unil . Difficult to handle are 
reflective virtual objects, as in general they would have to reflect things from 
the surrounding environment that are not visible in the image. Rendering the 
reflections from the real environment onto virtual objects requires the availability 
of a high-quality reality model. 

For an augmented world to be realistic the virtual objects not only have to 
interact optically with the real world, but also physically. This applies to virtual 
objects when animated or manipulated by the user. For example, a virtual chair 
shouldn’t go through walls when it is moved, and it should exhibit gravitational 
forces p|. Given a reality model, this behavior can be achieved using collision 
detection and avoidance systems that are known from Virtual Reality systems 

E9- 

These two laws make up the most important physical constraints. A full 
physical simulation including more aspects of the interaction between real and 
virtual objects, such as elastic behavior and friction, would be desirable. For off- 
line applications this is possible if enough information about the virtual objects 
and a complete enough reality model is available. For real-time applications most 
simulation systems are not fast enough. Yet, even simple implementations of the 
above rules will make the system much more realistic. 
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4 Future Work 

So far, we have used reality models for camera calibration, occlusion handling 
and a simple analysis of light reflections from virtual onto real objects. Yet, the 
concept of generating, using and updating reality models within AR applications 
goes far beyond this. We will now briefly allude to two other relevant aspects of 
AR that we expect to become important in the future. 

4.1 Leaving Reality Behind 

Augmented reality and virtual reality are not two discrete alternatives but rather 
part of a spectrum of mixed realities m with full virtual reality on one end and 
full physical reality on the other. Augmented Reality is in the middle, combining 
the best of both worlds. But sometimes it might be desirable to lean more in 
one direction or the other. 

Since registered augmented reality by concept needs real images, its freedom 
of movement is limited to the places where an image recording device (possibly 
a human eye) can go. Virtual reality on the other hand allows complete freedom 
of movement, as computer generated images can be generated for every possible 
viewpoint. It may sometimes be desirable to leave the augmented reality be- 
hind and switch into the virtual reality to take a look from a point where it is 
physically impossible to go, e.g. from above. 

When leaving reality behind, the view has to be constructed entirely from 
synthetic information, i.e., from the reality model plus new virtual objects. A 
very promising area of current computer graphics research in this direction is 
image based rendering 0 El El , which strives towards generating images 
from new viewpoints given some images from other viewpoints. A future system 
might employ a camera to record images while viewing the augmented scene and 
using them to incrementally refine the reality model. The ever-improving reality 
model allows the system to render increasingly realistic synthetic images from 
places that have not been visited by the user. 

4.2 Diminishing Reality 

Many construction projects require that existing structures be removed before 
new ones are built. Thus, just as important as augmenting reality is technology 
to diminish it. 

FigureEk shows one of several pictures of TV-towers on Monte Pedroso near 
Santiago de Compostela, Spain. Prior to augmenting the image with a model 
of a new TV tower, the existing towers need to be removed (Figure ^1)). To 
this end, a part of the sky has to be extrapolated into the area showing the 
TV-towers and the barracks. Then the new tower can be put into place (Figure 

int). 

We currently use interactive 2D tools to erase old structures from images 
(Figurel4). This approach can only be used for static, individual photos, but 
not for video sequences from a live, dynamically moving camera. 
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(a) Original image. (b) Diminished reality. (c) Augmented reality. 

Fig. 11. Monte Pedroso near Santiago de Compostela, Spain. 



In principle, the problem of diminshing reality consists of two phases. First 
expiring buildings have to be identified in the image. When such structures are 
well represented in a reality model, they can be located by projecting the model 
into the image according to the current camera calibration. 

The outdated image pixels then need to be replaced with new pixels. There 
is no general solution to this problem since we cannot know what a dynamically 
changing world looks like behind an object at any specific instant in time - unless 
another camera can see the occluded area. Yet, some heuristics can be used 
to solve the problem for various realistic scenarios. We can use morphological 
operators to extrapolate properties of surrounding ’’intact” areas (e.g: a cloudy 
sky) into outdated areas. Furthermore, when a building is to be removed from 
a densely populated area in a city, particular static snapshots of the buildings 
behind it could be taken and integrated into the reality model to be mapped as 
textures into the appropriate spaces of the current image. First results of such 
” X-ray vision” capabilities are shown in Figure Eh). 

For video loops of a dynamically changing world, computer vision techniques 
can be used to suitably merge older image data with the new image. Faugeras et 
al. have shown that soccer players can be erased from video footage when they 
occlude advertisement banners: For a static camera, changes of individual pixels 
can be analyzed over time, determining their statistical dependence on camera 
noise. When significant changes (due to a mobile person occluding the static 
background) are detected, ’’historic” pixel data can replace the current values 

E3- 

We use geometric constraints to compute pixelwise correspondences between 
regions in several images that outline a particular object (Figure 15). From such 
correspondances, we can trace specific points on the object across all images and 
we can decide in which images it is visible or occluded. Accordingly, occluded 
pixels can be replaced by visible ones, effectively removing the occluding object 
from the image. In more general schemes using mobile cameras, such techniques 
can lead towards incremental techniques to diminish reality. While moving about 
in the scene, users and cameras see parts of the background objects. When 
properly remembered and integrated into a three-dimensional model of the scene, 
such ’’old” image data can be reused to diminish newer images, thus increasingly 
effacing outdated objects from the scene as the user moves about. 
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Fig. 12. Automatic removal of a lego block. 



5 Conclusions 

Reality models are a crucial aspect of Augmented Reality work. Implicitly or 
explicitly, every AR application relies on knowledge about its surroundings in 
order to augment the reality appropriately. As reported in this paper, manual 
and semi-automatic techniques are commonly used to set up the models - a time 
consuming and complex task. Tools to automatically generate such models, e.g., 
via computer vision algorithms will greatly improve the quality and flexibility of 
AR applications. As a first step, the off-line generation of static scene descriptions 
will suffice. Yet, the long-term goal must be to automatically update and improve 
the models in real time while the application takes its course. At that point, AR 
applications will be able to deal with moving people and objects in the scene, 
such as objects that are being disassembled as part of the task. Users will also be 
able to temporarily leave reality behind to explore aspects otherwise unreached. 
Furthermore, they will not only be able to augment reality but also to diminish 
it, virtually removing objects by showing what is behind them - according to 
the current 3D information in the reality model. 
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Sources of Graphical Material 

Sunderland Newcastle, UK (with Ove Arup and Partners): Figures CJi and 
show a picture of the river Wear, Sunderland Newcastle, UK. The bridge 
model was provided by Sir Norman Foster and Partners. 

Thames river, London, UK (with Ove Arup and Partners): Figures Et,b, 
a.ndfrnishow pictures of the river Thames, London near St. Paul’s Cathedral. The 
3D model of a designed milleneum footbridge was provided Sir Norman Foster 
and Partners. The model was acquired for the CICC project by Ove Arup and 
Partners. 

Bluewater Kent, UK (with Bovis and Trimble Navigation Limited): Fig- 
ures Eb andlSb show a picture from a video sequence. 

Santiago de Compostela, Spain (with Ove Arup and Partners): Figures 
^Dt,b,c show a picture of Monte Pedroso near Santiago de Compostela, Spain. 
The model of the TV-tower was provided by Sir Norman Foster and Partners. 

Gmunder StraCe, Munich, Germany (with Philipp Holzmann AG, Ger- 
many): Figures^ andUi,b show indoor pictures of a bathroom under construc- 
tion. Figure[5^ shows an outdoor snapshot. 

Valbonne, Prance (with INRIA Sophia- Antipolis) : Figures |Hbib,c show a 
picture and a reconstructed model of the Arcades in Valbonne, France. 

Royal Institute of Charted Surveyors, London, UK (work by U. Leeds, 
JRC and BICC): Figure El shows pictures of the RESOLV trolley and the re- 
constructed model of the Royal Institue of Charted Surveyors. Courtesy of the 
RESOLV project. 
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Abstract. In this paper we present a new interactive collaborative Aug- 
mented Reality system and demonstrate its functionalities in an collabo- 
rative design application. After a short investigation of the requirements 
of a collaborative AR-system we introduce the main components and 
key features of our implementation. An easy and intuitive method for 
augmenting video frames based on available 3D geometry is introduced. 
Our system is able to handle object interactions such as mutual occlu- 
sion of real and virtual objects or collision between objects of both types. 
’Reality’ is transmitted as a video stream to all partners in the collabora- 
tion. Compression technique is used to compress the PAL-size color video 
frames before transmission. Our system has been successfully tested in 
a test environment between Darmstadt and Rostock. 



1 Introduction 

Augmented Reality (AR) is similar to the widely known Virtual Reality (VR). In 
VR a user is completely immersed in a synthetic, computer-generated environ- 
ment and cannot see the real world around him. In Augmented Reality a user is 
able to see his real environment and additional virtual objects are superimposed 
upon the users view. Therefore Augmented Reality enriches reality rather than 
completely replaces it. A user interacts with the real world in the usual, natural 
way and employs the computer simultaneously either to interact with virtual 
objects or to obtain additional information. As a result, compared with VR, 
AR requires less computer performance. Augmented Reality has been explored 
in several scenarios. In the field of medicine surgery can be trained, assisted 
and guided by superimposed, registered views of medical data (e.g. MRI, CT 
or ultrasound) [21 053 • Applications of Augmented Reality in the field of as- 
sembly, maintenance, and repair have been demonstrated [7jj 1 8) 1 16]. Augmented 
Reality can also bee used for annotation iniisi, visualization 0 or planning pur- 
poses, e.g. urban planingf^. Ahlers et al. demonstrated the use of Augmented 
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Reality in the field of design In the field of design, Augmented Reality is a 
very promising technique. Visualization of products and new design ideas can 
be greatly improved by Augmented Reality. Nowadays the convincing presenta- 
tion of new products is a lengthy and often very expensive task because realistic 
physical models or mock-ups have to be built. The use of computer-generated 
models in a VR scenario is an alternative to mock-ups or physical models. These 
models are then presented in virtual environments and users are allowed to ex- 
plore them by flying over or walking through the models. In addition to that, 
Augmented Reality offers the advantage to present these computer models in the 
real environment, which can be given as a video, captured at the location where 
the real object will be placed later on. Users get a much clearer understanding 
and impression of the model within its intended real environment. In this paper 
we present an Augmented Reality system for design applications. Designing is 
often the task of a group and not only of a single person. Our system provides 
support for collaborative work of a group of users. A video stream of a real envi- 
ronment is distributed to all participants in the collaboration and functionalities 
are provided to collaboratively augment the video by adding and manipulating 
virtual objects. Interactions of real and virtual objects are taken into account 
by handling of occlusion and collision events. 

2 System Requirements Analysis 

2.1 Display Technology 

Commonly used output devices in Augmented Reality systems are head-mounted 
displays (HMDs) (see-through |S| or non-see-through^) or computer moni- 
tors . See-through HMDs in combination with electromagnetic trackers have 
the advantage that one does not have to deal with grabbing and displaying of 
video streams and only has to display virtual information. These systems are 
usually fast and can work in realtime. However, a significant disadvantage of 
see-through HMDs becomes apparent when mutual occlusion of virtual and real 
objects occurs. See-through HMDs cannot completely block off light from real 
objects at places were they are occluded by virtual ones and virtual objects al- 
ways appear semi-transparent and are rather blended over instead into the real 
world. System latencies are critical, since synchronization of real world and vir- 
tual objects is a major problem. Virtual objects are delayed in movement when 
the user moves and appear rather to flow over the real world than to be part 
of it. Non-see-through HMDs display the real world as a video stream captured 
by one or usually two (for stereo) cameras attached to the HMD. The use of 
non-see-trough HMDs requires the additional afford to grab and display video 
streams. Compared to see-through-HMD , the use of non-see-through HMD re- 
quires a computer system with higher rendering performance. Synchronization 
is less problematic when using non-see-through HMDs since displaying of the 
augmented video frames can be delayed. ’’Floating” effects are only caused by 
registration errors. Mutual occlusion of real and virtual objects becomes possi- 
ble since eg. depth information about the real world can be used when merging 
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virtual objects and video. In collaborative AR-applications we must distinguish 
between two general scenarios. In scenario one all users are present at the same 
location, see virtual objects from different points and possibly interact with 
them. In the second scenario only one or a few users are present at the same 
location whereas other users are present at remote locations. In scenario one 
see-through HMDs can be used since the real world is directly available to all 
user. In the second scenario the real world is directly available only to a few users 
and has to be transmitted to other users. Non-see-through HMDs or computer 
monitors must therefore be used as output devices. In our collaborative design 
application where we assume that not all users are present at the same location 
computer monitors are used as output devices. A more comprehensive overview 
of displaying devices for Augmented Reality systems is given by Azuma.jTCj or 
Rolland et al. US!- 

2.2 Tracking 

Precise camera calibration and tracking is one of the most substantial problems 
in Augmented Reality. The demand of registration accuracy is much higher than 
in Virtual Reality. The reason for this demand lies in the nature of AR, in the 
combination of visual information with visual information. The human eye can 
easily detect offsets of a single pixel between e.g. a real object and its overlaid 
rendered model. To understand that we have to look at the anatomy of the hu- 
man retina. Its central part, the so-called fovea, has a resolution of about 0.5 
minute of a.rcjl l)j. Which means in that area the human eye can resolve alternat- 
ing brightness bands that subtend one minute of arc. Most Augmented Reality 
systems use electromagnetic ^1] or hybrid tracking technology [h] ^3 for tracking 
the movements of camera, user or objects. Vision-based tracking has been used 
by State et al. in an hybrid system^^ ^tnd by Koller et al. m- Electromagnetic 
tracking devices that are also commonly used in VR have an orientation accu- 
racy of 0.15 (Polhemus Corporation 1996) and therefore are not able to track 
with the accuracy that is needed in AP.[15|. Vision-based tracking is a very ac- 
curate but time consuming tracking methods. It fails when the used tracking 
marks are occluded or outside the field of view of the camera. Hybrid systems 
employing advantages of several tracking technologies may be most suitable for 
AR systems. Besides high accuracy and stability tracking techniques must of- 
fer feedback to measure and correct tracking errors. Standart electromagnetic 
tracking systems represent an ’’open- loop” controller system with no feed back 
of tracking errors to the system. There is no correleation between the tracking 
signal and the reference signal(video). With these kind of systems it is difficult 
to detect and correct tracking errors. In vision based tracking systems the track- 
ing signal is identical with the reference signal and tracking errors can easily be 
detected, corrected and if desired fed back to the tracking system. The build up 
of a ” closed- loop” controller system is therefore easier when using vision based 
tracking systems. In our collaborative AR applications we currently use vision- 
based tracking that employs tracking marks and kalman filter technique for fast 
and precise tracking. 
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2.3 Networking 

Distribution of Augmented Reality is somewhat different from distributing e.g. 
VR. In AR not only virtuality and user interactions need to be distributed, but 
reality too in form of e.g. a video stream has to be transferred to all participants 
in the collaboration. Distribution of video streams means distribution of huge 
amounts of data. The normal approach to transfer only small images or images 
of low resolution cannot be accepted in AR since the ” illusion” of reality is dom- 
inated by the view (video) of the real world. Therefore the network architecture 
that should be used in an distributed AR application must be able to handle 
a significant amount of data. There are several network architectures that are 
used in VR applications. Basically they can be reduced to three approaches: 
peer-to-peer, client-server and distributed. In the peer-to-peer model each user 
is connected to all other users by an one-to-one connection. Interactions have 
to transfer to each user separately. In an AR application such an approach is 
unacceptable, since the video stream would have to be transferred one time less 
than the number of users in the collaboration. Such an approach would decrease 
the performance of the application tremendously. However, this type of model 
is useful for creating private connections between certain users for e.g. distribu- 
tion of restricted information or for audio channels. In the distributed approach, 
the state of the world is distributed amongst all users. This model is very com- 
plex and it is difficult to maintain data consistency and coherence. The third 
and most commonly used approach used is the client-server approach were data 
from a client are sent to the server and then become distributed to some or all 
other clients. This approach is ideal for filter operations which can lead to re- 
duction of the network load. The server becomes a bottleneck when the number 
of users increases. Another problem in distributed applications is the location of 
data. This point needs consideration since it is very important for management 
mechanisms that have to be employed in order to maintain scene consistency. 
The most common data models are totally replicated database, shared central 
database and shared distributed database. In the last two models the database is 
stored just one time either at a central location or at several locations. In the case 
of a totally replicated database each user has its own database. Shared databases 
offer the advantage of an easy maintaining of scene consistency whereas a totally 
replicated database creates more problems concerning data consistency due to 
possible transfer losses or system latencies. In AR applications a total replicated 
database can be used when the users have computers with a high performance 
since in this case rendering is done locally and therefore all data have to be avail- 
able locally. When using low-performance computers a central database may be 
more appropriate since rendering can be done by a central high-performance 
computer and only the rendered frames are sent to the user. Considering fast 
feedback for user interactions with virtual objects, a replicated database offers 
significant advantages. In our application we use a client-server architecture with 
a totally replicated database since we assume only a limited number of users in 
the collaboration using computers with a sufficiently high rendering performance. 
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3 Collaborative Design Application 

3.1 Overview of the Augmented Reality System 

In this chapter we present how Augmented Reality techniques can be used in a 
design application and give an overview of our collaborative Augmented Reality 
system. 




Fig. 1. Sketch of the principle components of our collaborative Augmented 
Reality system; top: local components (client); bottom: network structure 

Figure Dl shows a sketch of the principal components of the implemented 
collaborative AR system. The input (video stream) is split and sent to the ren- 
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der system and to the tracking module. The video frame is rendered together 
with depth information of the real world. The calculated camera parameters are 
used to render the virtual objects. Video frame and virtual objects are combined 
by using z-keying. The combined video frame and virtual objects are then dis- 
played. This approach leads to a delay in displaying the video frame but assures 
synchronization of video frame and rendered virtual objects. When a collabora- 
tion is established, video frame grabbing and tracking is done only by the client 
that controls the real camera. Video frames and calculated camera parameters 
are transmitted to all other users in the collaboration. The video frames are 
compressed prior to transmission. Virtual objects and interactions are sent to 
all users. In the following section we give a more detailed description of the 
implemented system. 

3.2 User Interface 

Figure 13 shows the user interface of our collaborative AR application. The col- 
laborative scene is displayed in the main viewer. Additionally we provide two 
other viewer which allow the user to browse data bases and visualize objects 
without or before including them into the collaborative scene. In an collabora- 
tive session objects can be visualized locally (bottom viewer) or globally (top 
viewer). The model can be discussed about by all users and then be placed into 
the video. We do not share the camera position in this viewer to allow users an 
undisturbed investigation of the object. In a collaborative session users should 
be aware of their partners. We provide visual information about the partners in 
the collaboration by visualizing each partner as an icon on the interface below 
the viewer. In Figure O there are two icons representing two users. Note that 
there are three users in the collaboration since we do not visualize a user’s own 
icon. 

3.3 Calibration/Tracking 

As already laid out, standard electromagnetic trackers work fast but not very 
accurate and are not suitable for AR applications. Optical tracking is much 
more time consuming and requires very fast computers but is very accurate. 
These algorithms work only in more or less tailored environments [1 2| . High 
accuracy is needed when dealing with occlusion problems which often occur in 
e.g. interior design applications. In our system we use an optical calibration and 
tracking algorithm that has been developed by Roller et al. H2| and is described 
in detail there. Here we give only a brief description. Figure 0 shows the setup 
that is used for camera calibration and tracking. The 8 black squares are used for 
calibration and tracking. The algorithm assumes that the exact 3D coordinates 
of the black squares are known. In a first calibration step the camera is calibrated 
fully automatically. The corners of the black squares are detected in the image 
with a sub pixel accuracy and matched with the known 3D-coordinates. Based on 
the sets of 2D-image and 3D-world coordinates the external and internal camera 
parameters are calculated. The black squares are also used for tracking. In each 
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Fig. 2. User Interface of our AR- Application 



frame the corners of the black squares are searched for locally. The starting point 
for the local search is predicted by applying extended kalman filter technique. 
Based on the new corner positions camera position and orientation are calculated 
and the kalman filter is updated. 




Fig. 3. Setup for calibration and tracking 



3.4 Adding of Virtual Objects to the Real World 

In Augmented Reality scenarios such as annotation, interactive placement or 
manipulation of objects is not necessary whereas in a design application users 
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must be able to interact with the virtual objects. The used technique depends 
on the availability of geometrical information about the real world. In our col- 
laborative design application we assume that a 3D-description of the real world 
is available. We use this to allow a very intuitive and easy way to incorporate 
virtual objects into the real world. Figure 01 shows an example. The interface 
allows a user to select and mark corresponding points in the video and on the 
visualized virtual object. These selected points are visualized as red spheres in 
order to give a visual feedback of the selection to the user. Based on these marked 
points position, orientation and, if desired, size of the virtual object is calculated 
and the object is then placed into the video frame. In the example a user wants 
to place the fax machine in the lower right window onto the table. He marks e.g. 
3 points on the table and three points on the bottom of the fax machine. A copy 
of the fax machine is then placed at the marked 3D-position. 




Fig. 4. Adding a virtual object to the video frame is done interactively by 
marking corresponding points in the scene; left (before including): red spheres on 
the table and on the bottom of the virtual object (fax machine in the lower right 
window); right(after including): The position, orientation and size of the virtual 
fax machine has been calculated and the object is placed into the video frame 



3.5 Handling of Real - Virtual Object Interactions 

An important issue in Augmented Reality is the handling of interactions be- 
tween real and virtual objects. Basic interaction types are collision, occlusion 
and shadows. In some application fields such as annotation or guided surgery 
these kind of interactions are less important. In design applications one cannot 
neglect collision or occlusion handling. Proper collision and occlusion handling 
significantly increases the impression of virtual objects being part of the real 
world. Handling of collision and occlusion requires knowledge of the 3D geome- 
try of the real world. 

Mutual Occlusion of Real and Virtual Objects For occlusion handling 
two general cases must be dealt with. Real objects can occlude virtual ones 
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and vice versa. Virtual objects can be combined with video frames by applying 
luminance keying. In this technique, virtual objects are rendered on a black 
background. Black areas are then replaced by the video signal. As a result of this 
technique, virtual objects always occlude real ones. Breen et al.0 demonstrated 
occlusion handling in Augmented Reality by using a model-based and a depth- 
base method. Occlusion tests are performed on these depth-maps. Virtual objects 
are rendered in black and again luminance keying is used to achieve occlusion 
of virtual objects by real ones. 

In our system z- keying is used to achieve occlusion effects. The known geometry 
of the real world, given as a polygonal model, is registered to the video frame 
and then rendered only into the z-buffer of the rendering system. The rendering 
process results in a registered depth map for each video frame of the real world. 
The virtual objects are then rendered using the initialized z-buffer for occlusion 
detection. Based on the result of the depth test, the color for each pixel of the 
output frame is chosen from the video or from the virtual object. Figure Elshows 
an example for occlusion in AR. The virtual statue and chair occlude the real 
object wall and are partly occluded by the real office container. 




Fig. 5. left: Original video frame right: Virtual objects (chair, statue) occlude 
real objects (wall) and are partly occluded by a real object (office container) 



Collision between Real and Virtual Objects Performing collision detec- 
tion is more complicated than detection occlusion. Breen et al.|3| used registered 
depth-maps for collision detection. Collision is detected when the stored z- value 
is smaller than the z- value of a point of the bounding box of the virtual object. 
However this approach does not work in all cases. Consider a situation like the 
one depicted in Figure 0 The virtual chair can be moved behind the real office 
container since the office container stands sufficiently far away from the wall. The 
depth-map approach would not allow such a movement since the z- values of the 
chair are greater than the z- values of the container. Other approaches use axis 
aligned bounding box trees, sphere trees or oriented bounding box trees. In our 
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application, where we assume that objects undergo only rigid motion, Oriented 
Bounding Box Trees (OBBTrees) |H] are used for accurate collsion detection at 
interactive rates. In this approach OBBTrees are calculated for each object and 
collision detection is carried out by performing hierarchical intersection tests be- 
tween the OBBTrees of objects to be tested. This approach requires additional 
computing cost for calculating the OBBTree of each object. Fortunately this has 
to be done only once. In our implementation calculation of OBBTrees is done 
during initialization or when an object is added to the world. Figure 0 shows 
a sketch of the collision engine. When a user manipulates an object or a group 
of objects collision tests are performed between the manipulated objects (ac- 
tive objects) and all other objects (inactive objects) in the scene. No collision 
is performed between active objects. This can be done since in an active group 
objects do not change their position relative to other group members. A trans- 
formation that is applied to an active object is used to update the OBBTree 
and collision detection is performed at the new object position. When collision 
is detected the transformation is rejected and the OBBTree is reset. When no 
collision is detected the transformation is accepted and the OBBTree is not reset. 
The result of the collision detection is always returned to the AR application to 
allow further treatment. In case of collision we reject the last transformation and 
give an audible feedback to the user. As a result the object does not move any 
further when hitting another object. When no collision is detected, the trans- 
formation is accepted and the object is rendered at the new position. Moving 
objects in crowded worlds can be a lengthy and annoying task. To ease such 
tasks we allow the user to turn on and off collision detection interactively for 
selected objects. In the present implementation our collision detection engine is 
built upon RAPID p. 
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Fig. 6. Sketch of the collision engine 
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3.6 Distributing Augmented Reality 

Network Architecture In our application we use a client-server architecture 
with a totally replicated database since we assume only a limited number of users 
in the collaboration using computers with a sufficiently high rendering perfor- 
mance. Figuredshows a sketch of the network architecture of our AR application 
and the principle data flow between the clients and the server. Transmission of 
video stream and camera parameter takes place only from one client to all other 
clients whereas transmission of 3D-models and 3D-events takes place from all 
clients to all clients. 



Distributing 3D-Models and 3D-Events In a total replicated data base all 
models must be available locally at all clients sites and therefore needs to be 
distributed. Distributing large models over low bandwidth networks can take a 
considerable amount of time. Models should therefore be stored in an appropri- 
ate format such as VRML or Openinventor. Both formats can be used to store 
3D-information very effectively. VRML as an emerging standard for 3D descrip- 
tion in the Internet is preferable but only an ASCII data format is specified. Our 
application is based on Openinventor which offers the possibility to distribute 
objects in the binary Openinventor format. Before distribution objects are writ- 
ten to a buffer in memory. After distribution the model data are read again from 
a buffer and included into the local data base. Distribution of 3D-events are less 
critical since only changes in transformation matrices need to be distributed. 



Distributing of Video Stream and Camera Parameter A significant prob- 
lem in collaborative AR applications is the distribution of the huge amount of 
data of the video stream. Several techniques can be used to reduce the amount 
of data. Size, resolution or number of channels of the frames can be reduced 
or compression techniques such as JPEG or MPEG can be used. In an AR ap- 
plication reduction of frame size or resolution cannot be used since the video 
frame is a significant part of the scene. MPEG offers a high compression rate 
but takes too long. The compression rate of JPEG is lower than that of MPEG 
but compression takes less time. Gurrently we use JPEG for frame compression 
and decompression. Using hardware compression/decompression instead of the 
currently used software solution will reduce the compression time significantly. 
Gamera calibration and tracking is done only by the user who has the physical 
camera. Together with each frame the camera parameters are distributed so that 
synchronization between camera parameters and video frames is assured. 



Maintaining Scene Consistency Since we use a totally replicated data base 
we need to take care of maintaining scene consistency and synchronizing user 
interactions. Global information about the data base(scene) are stored at a cen- 
tral position, the server. When a user interacts with the scene two general types 
of interactions are possible, global and local interactions. Local interactions and 
changes to the scene are not controlled by the server. Global interactions need 
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to be authorized by the server before acceptance. Figure Q shows an example for 
local and global interactions. User 1 marks points for including an object. The 
points are visualized as red spheres. This interaction is a local one and does not 
need to authorized and transmitted to other users. User 2 performs an global 
interaction by selecting an object(chair). The interaction needs to be authorized 
in order to maintain consistency of the scene. In the depicted example user 2 
selects the chair. A locking request is sent to the server. The server compares the 
object with a list of locked objects. The chair is not locked by another user. The 
locking request is confirmed and a locking signal is sent to all other users. User 
2 is allowed do manipulate the object whereas the object is locked for all other 
users. Locked objects are enclosed by an opaque bounding box to visualize the 
locking. When a locking request is rejected all manipulations done by the user 
who requested the lock are reset. 




Fig. 7. left: local view of user 1; right: local view of user 2; user 1 performs 
local interaction (marking points for including an object); user 2 performs global 
interaction by selecting an object (chair) 



4 Conclusion and Future Work 

We have developed and tested an Augmented Reality system for interactive 
collaborative design tasks. Object interactions such as collision detection and 
occlusion are taken into account and handled byt the system. Fast and accu- 
rate collision detection is performed by using OBBTrees. Occlusion handling is 
based on geometric models of the real world. Video streams are compressed prior 
transmission to decrease network load. Scene consistency is maintained by ap- 
plying locking mechanisms. The system has succesfully tested in trials between 
Rostock and Darmstadt. Besides the performed test trials we will test our col- 
laborative design application in an transatlantic test environment between our 
institute in Darmstadt and our branch located in Providence (Rhode Island, 
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USA). Currently our system can only be used in pre-modeled static scenes. The 
used tracking algorithm relies on the black squares as tracking features. Future 
work will concentrate on the development of new tracking algorithms in order 
to be able to use our system in more general environments. 
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Abstract. The aim of this paper is to give the non-specialist reader 
a comprehensible and intuitive introduction to multiview relations. It 
focusses on the geometric interpretation of the different image relation- 
ships, but also presents a concise mathematical formalism which allows 
to derive the algebraic expressions explicitly in an elementary and uni- 
form manner. Special attention has been paid both to these multiview 
constraints as geometric incidence relations between image features (i.e. 
points and lines) in different views as well as to their use for image trans- 
fer. Moreover, an attempt has been made to provide sufficient pointers to 
the literature where the interested reader may find additional information 
on particular subjects as well as alternative viewpoints and mathematical 
formalisms. 



1 Introduction 

During the last years important progress has been made in the analysis and 
reconstruction of 3-dimensional (3D) scenes from multiple images and image 
sequences obtained by uncalibrated cameras. Many of these developments were 
made possible by the discovery and new insights gained in the interrelationships 
between corresponding geometric features observed in different images of the 3D 
scene. These multiview relations are derived and studied in an elaborate series 
of research papers by a multitude of methods and mathematical formalisms, 
each having there own advantages. For the interested, but non-specialist reader, 
however, it might not always be easy to select the reference that is best suited 
for his application. The aim of this paper therefore is to provide an intuitive and 
non-technical introduction to the key ideas and concepts of multiview relations. 
The focus of attention is on the geometrical interpretation of the interimage 
relationships and on how they relate to the 3-dimensional structure of the scene. 
As a stepping stone towards the different formal expressions encoutered in the 
literature, special care is taken to translate the geometrical relationships into 
algebraic formulas in a clear and concise manner which should help not only 
to memorize and recall the formulas, but also to provide the reader with a 
workable knowledge to derive the most favourable formulation for the problem 
at hand. Therefore, the mathematics is kept as simple as possible. Essentially, 
a working knowledge of linear algebra and analytical geometry suffice. Some 
familiarity with basic notions of projective geometry might be helpful, but is 
not a prerequisite for understanding the text. 
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It is clear that within the extend of this paper it is impossible to give a 
complete overview of all the results presented in the extensive literature on this 
subject, nor of the different approaches that have been used to derive them. 
Where possible, an attempt has been made to provide pointers to the litera- 
ture where the interested reader may find additional information on particular 
topics or alternative viewpoints and mathematical formalisms. It should be re- 
alised, however, that also these references are not intended to provide a complete 
overview of the field. 

The paper is organised as follows: Since our presentation is mainly geomet- 
ric in nature, a good understanding of the image formation process is essential. 
Therefore, the paper opens in section 0 with an extensive treatment of the per- 
spective camera model. Next, the binocular relations which hold between any 
two views of a static scene are discussed and the well-known epipolar constraint 
is derived in section 0 The trinocular relations between three views of the scene 
are elaborated in section ^ Apart from the underlying geometrical concept (sec- 
tions ^3] , the interrelationship between the trifocal constraints for points 

and those for lines is explained in section lOl Special attention also goes to their 
use for point and line transfer between the images in section 14.41 The close con- 
nection between the epipolar and trifocal constraints is explored in section [4.51 
The same “decoupling principle” then leads to the quadrinocular relations be- 
tween four views and their interpretation in section El An interesting transfer 
principle between four views is derived as well. The general theory for n views is 
presented in section 0 It is first proven in section Hm that the relations between 
5 or more views boil down to the epipolar, trifocal and quadrifocal constraints 
for the different image pairs, triples and quadruples that can be formed with the 
given views. Section |^2I then completes the story by showing that the epipolar, 
trifocal and quadrifocal constraints presented in the previous sections cover all 
existing relations between two, three and four images. 

Finally, it is emphasized that the results and insights formulated in this paper 
are influenced by many authors. Mentioning all publications to which this paper 
is indebted would amount to citing all the references listed at the end. However, 
the following articles have greatly influenced the presentation (in alphabetical 
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2 The Perspective Camera Model 

In this paper, the image formation process in a camera is modeled as a perspective 
projection of the scene onto an image plane. In mathematical terms, the scene 
is defined as a collection of points, lines and surfaces in Euclidean 3-space . 
The mathematical relation between the coordinates of a scene point and its 
projection in the image plane is easiest described in a camera- centered reference 
frame. This is a right-handed, orthonormal reference frame for the scene which 
is defined as in Figure^ (left): The origin coincides with the center of projection 
(i.e. the center the lens of the camera), the Z-axis is the optical axis of the 
camera, and the AE-plane is the plane through the center of projection and 
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perpendicular to the optical axis. The image plane is the plane with equation 
Z = 1. The camera-centered reference frame of the scene induces an orthonormal 
reference frame in the image, as depicted in the figure: The origin is the point 
of intersection of the image plane with the optical axis of the camera (i.e. the 
Z-axis); and the coordinate axes in the image are parallel to the X- and l"-axis 
of the camera-centered reference frame of the scene. 

The image of a scene point P is the point of intersection p of the line through 
P and the origin of the camera-centered reference frame and the image plane 
with equation Z = 1. If P has coordinates (X, T, Z) S IB? with respect to the 
camera-centered reference frame, then the (it, i;)-coordinates of its image p are 
u = Y i! = ■^. It is important to note that the triple (it, v, I) G F? can be 
interpreted both as the scene coordinates of the image point p with respect to 
the camera-centered reference frame, as well as the direction vector of the ray of 
sight of the camera which passes through the scene point P. 



• (X.Y.Z) 




Fig. 1. Left: In a camera- centered reference frame, the image of a scene point 
{X, y, Z) is (u,v) = (^, ^). Right: The position and orientation of the camera 
in the scene are given by a position vector C and a 3x3-rotation matrix R. The 
image (u,v) of a scene point (X,Y,Z) is then given by formula (d). 



When more than one camera is used, or when the objects in the scene are 
represented with respect to another, non-camera-centered reference frame (called 
the world frame), then the position and orientation of the camera in the scene 
is described by a point C, indicating the origin, and a 3 x 3-rotation matrix R 
indicating the orientation of the camera-centered reference frame with respect 
to the world frame. More precisely, the column vectors of the rotation matrix 
R are the unit direction vectors of the coordinate axes of the camera-centered 
reference frame, as depicted in Figured (right). The coordinates of a scene point 
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P with respect to the camera-centered reference frame then are found by taking 
the dot products of the relative position vector P — C with the unit vectors ; 
or equivalently, by pre-multiplying the column vector P — C with the transpose 
of the orientation matrix R : i?*( P — C ). Hence, if P has coordinates {X, Y, Z) 
and C has coordinates (Ci, C2, C3) with respect to the world frame, then the 
projection of P into the image plane has (u, u)-coordinates: 



and 



_ — Cl) -I- 7’2i(F — C2) + 'T 3 i{Z — C3) 

^ - ri3(X - Cl) + r23(r - C2) + r33(Z - C3) 

_ ri2{X — Cl) + T22{Y — C2) + r 32 {Z — C3) 

^ - nsiX - Cl) + r 23 {y - C2) + T33{Z - C3) ’ 



( 1 ) 



where is the (*, j)th entry of the rotation matrix R. 

When working with digital images it is more natural to indicate the position 
of an image point in pixel coordinates. The transition from the geometrical (it, v)- 
coordinates to the pixel coordinates, which will be denoted as (x,y), is modeled 
by an (affine) transformation of the form: 



J x = kxU + sv + xo 

\y = kyv + yo 



Here, (xo,yo) are the pixel coordinates of the origin of the itu-reference frame, 
which is called the optical center of the image, kx and ky indicate the number of 
pixels per unit length in the horizontal and vertical direction respectively; and 
thus implicitly describe the length and width of a pixel. Their ratio kx/ky is 
called the aspect ratio of the camera. Furthermore, s measures how strong the 
shape of the pixels deviates from being rectangular. This parameter is usually 
referred to as the skewness of the pixels, s = 0 corresponds to rectangular 
pixels. Clearly, the numbers kx, ky and s depend on the focal length and the 
zooming distance of the lens. Together, kx, ky, s, xq and yo are referred to as 
the intrinsic camera parameters' whereas, the scene point C and the rotation 
matrix R representing the position and orientation of the camera, are called the 
extrinsic camera parameters. 

More elegant formulas are obtained if one uses extended coordinates for the 
image points. In particular, if a point p in the image plane with pixel coordinates 
{x,y) is represented by the column vector p = (x,y,l)*, and if its geometric 
coordinates (it, v) are represented by the column vector (u, v, 1), then formula (EJ 
becomes 

( x\ / kx s xq\ / u\ 

1/ j = ( 0 fcy I/O 1 K 1 • (3) 

The 3 X 3-matrix 

( kx s Xq\ 

0 ky yo \ (4) 

0 0 ij 

is called the calibration matrix of the camera. If (u,u,l)* is interpreted as the 
direction vector of the ray of sight of the camera, which passes through the scene 
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point P, then the calibration matrix K can be viewed as the transformation that 
relates an image point with its corresponding ray of sight in the camera- centered 
reference frame of the camera. Furthermore, formula o can be rewritten as 

/ u\ /riir2ir3i\ / X - Ci\ 

p \ V \ = \ ri2 r22 rs2 \ \ Y - C 2 \ , (5) 

V 1 / V ^13 ?’23 rs 3 J \Z -C 3 J 

where p = ri 3 {X — Ci) + r 23 {Y — C 2 ) + r 3 s(Z — C 3 ) is a non-zero real number. 
Together, formulas and 0 combine into the projection equations: 

pp = iFi?‘(P-C) (6) 

for some non-zero p € IR. Many authors prefer to use extended coordinates for 
scene points as well. So, if P = (X, Y, Z,iy are the extended coordinates of the 
scene point P = {X, Y, Zy, then the projection equations (0 become 

pp= {KR^ I -KR^C) P . (7) 

The 3 X 4-matrix M = {KRf \ —KR^C ) is called the projection matrix of the 
camera. Readers who are familiar with projective geometry will observe that 
equation 0 defines a projective mapping from projective 3-space to projective 
2-space. To this end, the extended coordinates p and P must be interpreted as 
(an instance of) the homogeneous coordinates of the corresponding image and 
scene point. 

3 Relation Between Two Views: The Epipolar Constraint 

When studying the relationships between different views of a static scene, it can 
be useful to ask oneself the following two questions: 

(1) From which scene points can the geometric features (i.e. points and lines) 
that one observes in the image(s) be the projections? 

(2) Where can these scene structures be observed in the other image(s)? 

For example, a point p in one image is the projection of a scene point P that 
can be at any position along the ray of sight of the camera creating the image 
point p. Therefore, the point p' corresponding to p in the second image (i.e. 
the projection p' of P in the second image) of the same scene must ly on the 
projection 1' of this ray of sight in the second view, as depicted in Figure El To 
turn this into a mathematical formula, suppose for a moment that the camera 
parameters of both cameras are known. Then, according to formula 0, the 
direction of the ray of sight creating the image point p in the first camera is 
given by the 3-vector K~^p in the camera-centered reference frame of the first 
camera, where K is the calibration matrix of the first camera. With respect to 
the world frame, this direction vector is RK~^p, where R is the rotation matrix 
expressing the orientation of the first camera in the world frame. As the position 
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Fig. 2. The point p' in the second image corresponding to a point p in the first 
image lies on the line 1' which is the projection in the second image of the ray 
of sight creating the point p in the first camera. 



of the first camera in the world frame is given by C, the parameter equations of 
the ray of sight in the world frame are: 

P = C + pRK~^ p for some p G M. (8) 

Note that this equation is just a rewriting of the projection equations (EJ- So, 
every scene point P satisfying equation (j2D for some real number p projects 
onto p in the first image. If K' , R' and G are the camera parameters of the 
second camera, then, according to formula (EJ again, the projection p' of P in 
the second image is given by 

p' p' = AT'i?'* ( P - C' ) (9) 

for some non-zero p' G M. Substituting expression Q for P into this formula, 
yields 

p'p' = pK'R'*RK-^p + K'R'* {C-G) . (10) 

The last term in this equation corresponds to the projection e' of the position 
C of the first camera in the second image: 

p'^e' = K'R'\C-G) . (11) 

e' is called the epipole of the first camera in the second image. The first term 
in the righthand side of the equality, on the other hand, indicates the direction 
of the ray of sight (0 in the second image. Indeed, recall that RK~^p is the 
direction vector of the ray of sight in the world frame. In the camera-centered ref- 
erence frame of the second camera, this vector is R'*RK~^ p. The corresponding 
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position in the second image is then expressed (in homogeneous coordinates) by 
K' R'* RK~^ p. Put differently, K'R'^RK~^p are the homogeneous coordinates 
of the vanishing point of the ray of sight m in the second image. 

To simplify the notation, put A = K' R'^ RK~^ . Then A is an invertible 
3 X 3-matrix which, for every point p in the first view, gives the homogeneous 
coordinates Ap of the vanishing point in the second view of the ray of sight 
observing p in the first camera. In terms of projective geometry, A is a matrix 
of the homography that maps the first image onto the second one via the plane 
at infinity of the scene m- Formula 113) can now be rewritten as 

p'p' = pAp + p'^e' . (12) 

Observe that equation (d algebraically expresses that, for a given point p in 
one view, the corresponding point p' in the other view lies on the line 1' through 
the epipole e' and the vanishing point Ap of the ray of sight ofp (see Figure EJ- 
This line F is called the epipolar line corresponding to p in the second image. This 
geometrical relationship is more efficiently expressed by formula (m in the next 
proposition. But first, we fix some notation: for a 3-vector a = (oi, 0203 )* G 
let [ a ] X denote the skew-symmetric 3 x 3-matrix 

( 0 -03 02 \ 

03 0 -oi , (13) 

-02 oi 0 / 

which represents the cross product with a;i.e. [a]xV = axv for all v G 
Observe that [a]x has rank 2 if a is non-zero. 

Theorem 1. (Epipolar constraint) |5I11 | For every two views I and I' of 

a static scene, there exists o 3 x S-matrix F of rank 2, called the fundamental 
matrix of the image pair (/, I'), with the following property: IfpGl and p' G I' 
are corresponding points in the images, then 

p'‘Fp = 0 . (14) 

Moreover, the fundamental matrix F is given by F = [ e' ] x where e' is the 
epipole in I' and A is the SxS-(homography-)matrix defined above. 

Proof. According to equation 1T2jl . the point p' G F corresponding to p G / 
lies on the epipolar line 1' in I' which passes through e' and Ap. Since p', 
e' and Ap actually are 3-vectors representing the homogeneous coordinates of 
the corresponding image points, this geometrical relationship is algebraically 
expressed as | p' e' Ap | = 0 where the vertical bars denote the determinant of 
the 3 X 3-matrix whose columns are the specified column vectors. Recall from 
linear algebra that this determinant equals 

Ip' e' Ap I = p'* (e' X Ap) ; (15) 

or, expressing the cross product as a matrix multiplication. 

Ip' e' Ap| =p'‘[e']xAp . (16) 



This proves the theorem. 



□ 
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Remark 1. By definition of the fundamental matrix, e'^F = 0. So, the epipole 
e' in I' can be computed if F is known. Moreover, the epipolar constraint dnt 
brings, for each pair of corresponding image points p and p' in I and F re- 
spectively, one homogeneous equation p'*Fp = 0 that is linear in the entries of 
the fundamental matrix F . Hence, F can be computed linearly, up to a non-zero 
scalar factor, from (at least) 8 point correspondences between the two images. 
Due to the presence of noise in the images, the matrix F computed from these 
point correspondences generally will not be of rank 2. Imposing the rank 2 con- 
straint for the computation of F, however, results in a non-linear criterium. For 
an overview and comparison of different estimation procedures for F, the inter- 
ested reader is referred to Robust methods for computing F can be found 
in 

4 Relations Between Three Views: The Trifocal 
Constraints 

Next, suppose that three images of the same scene are given. If the fundamental 
matrices Fi^ and F 23 between the first and the third, respectively the second 
and the third, view are known, then the position of the point p" in the third 
image, which corresponds to the points p in the first and p' in the second image, 
is easily found as the intersection of the epipolar lines 1" of p and I'f of p' in the 
third image, as is depicted in Figure 0 Unfortunately, this construction breaks 



image 1 




image 3 




Fig. 3. The point p" in the third image corresponding to the points p in the 
first and p' in the second view is the intersection of the epipolar lines l'( and 1 ') 
corresponding to p and p' respectively. 

down when the epipolar lines V{ and Vf coincide. This happens for scene points 
belonging to the plane defined by the three camera positions (i.e. the three 
centers of projection). Moreover, the construction is also poorly conditioned 
for scene points that are close to this plane. In (see also |2S!) trilinear 
relations between the homogeneous coordinates of corresponding points in three 
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views were derived, which were proven to be are algebraically independent of 
the epipolar constraints in m- But, before writing down these formulas, let us 
examine the geometrical setting first. 



4.1 The Fundamental Trifocal Constraint 

The trifocal constraints essentially describe the geometric incidence relations 
between image points and lines in three views. The fundamental relationship is 
depicted in Figure 0 : Suppose p, p' and p" are corresponding points in three 




Fig. 4. Two arbitrary lines F and \" through corresponding points p' and p" in 
respectively the second and the third image define a line L in the scene whose 
projection 1 in the first image contains p. 



views /, r and I" of the same scene. Consider an arbitrary line 1" in the third 
image. The rays of sight in the world, creating the line 1" in rule out a world 
plane tt" containing the center of projection C" of the third camera. Moreover, if 
the line 1" passes through p", then this projecting plane tt" must also contain the 
scene point P of which p, p' and p" are the projections. Similarly, every line 1' 
through the point p' in the second view defines a projecting plane tt' in the world 
which also contains P. Consequently, the projecting planes tt' and tt" intersect 
in a line L through the point P in the scene. The projection 1 of L in the first 
image therefore must contain the image point p in I. Theorem El expresses this 
incidence relation algebraically. But first, we fix the following notation: An image 
line 1 with equation ax + hy + c= Qis represented in the sequel by the column 
vector {a,b,cY S As the triple (Aa, A&, Ac)* G represents the same line 
for every non-zero X G M, these column vectors have to be interpreted as the 
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homogeneous coordinates of the given line. We will denote this by 1 ~ (a, 6, c)* 
where the symbol ~ means “equality up to a scalar factor”. 

Theorem 2. (Fundamental trifocal constraint) |13j For every three views 
I, r and I” of a static scene, there exists a 3 x 3x3-tensor T = ( TF j , 

\ / l<i,j,k<3 

called the trifocal tensor of the image triple (I, I' , I"), with the following property: 
If p € I , p' € I' and p" € I" are corresponding points in the images, then for 
every line h through p' in I' and for every line \" through p" in I" , 

1'* [Tp] 1" = 0 , (17) 

where [Tp] is the 3x3-matrix whose {i,j)th entry is 

[Tp],,=Tl^x + Tl^y + Tl= (18) 

and with p = {x,y, 1)*. Moreover, if, with the notations as before, 

A = ^ K'R'^RK-^ and B = ^ K"R"^RK~^, then 

[Tp] = (Ap)e"‘-e'(Tp)‘ , (19) 

where e' and e" are the epipoles of the first camera in the second and the third 
view respectively. Furthermore, the entries of the trifocal tensor T are given 
by 

Tfc-’ = aik{e”)j - bjk{e')i for I <i,j,k< 3, (20) 

where (e')i and {e")j are the ith and jth entry of the epipoles e' and e" , and 
with Uij and bij being the (i,j)th entry of the matrices A and B respectively. 

The theorem is proven by rephrasing the previous geometrical construction in 
algebraic terms. But first, we need to know how to find the projecting plane of 
a line in the image. 

Lemma 1. m The projecting plane tt of a line 1 in the image obtained by a 
camera with projection matrix M = ( KR^ \ —KR^C ) has equation 

l*Tri?‘(P-C) = 0 (21) 

in the world frame. Homogeneous coordinates for tt are given by the j-vector 
TT-MTe IR^. 

Proof. A world point P belongs to the projecting plane tt generating the line 
1 in the image if and only if the projection p of P in the image lies on the 
line 1 ; i.e. l‘p = 0. According to the projection equations (0, P is given by 
pp — ATi?*( P — C ) for some non-zero scalar p G M. Substituting this expression 
for p in the equality l‘p = 0 yields formula (EU of the lemma. Using extended 
coordinates P = (A, Y, Z, 1)* for the world points P = (A, Y, Zy, equation (ETt 
becomes 1* ( ATi?‘ | —KR'^C ) P = 0. The middle matrix in the lefthand side 
of this equality is the projection matrix M of the camera. So, the equation can 
simply be rewritten as l‘A/P = 0, implying that the 4-vector M*l G are 
homogeneous coordinates for tt, as claimed by the lemma. □ 
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Proof of Theorem Let P be the scene point of which p, p' and p" are the 
projections in the different images. According to Lemma ^ P belongs to the 
projecting plane tt' generating the line L in the second camera if and only if 

V*K'R'\P-C')=0 . (22) 

Similarly, the scene point P belongs to the projecting plane tt" generating the 
line 1" in the third camera if and only if 

( P - C" ) = 0 . (23) 



As P projects onto p in the first image, it must also ly on the ray of sight 
observing p in the first camera; i.e. 

P = C + pRK~^ p for some p G M, (24) 



by formula (jSI). Substituting this expression for P in equations i'TA ) and 
gives 



r*K'R'\C-C') + p r* K' R'* RK~^p =0 

( C - C" ) + p V'*K''R''*RK-^p = 0 



(25) 



Remember from formula (II I II that K'R'*{C — C' ) = p(, e' gives the epipole e' of 
the first camera in the second view; and that K" R"^{C — C" ) = p" e" gives the 
epipole e" of this camera in the third image. Take A and B as in the theorem. 
Then, after division by the non-zero scalars p(. and p" respectively, system (tit) II 
becomes 



+pV^Ap =0 
l"‘e" -h p l"‘Sp = 0 



(26) 



Eliminating the unknown scalar p G M in the previous two equations, gives 



( 1'* Ap ) ( l"‘e" ) - ( l'*e' ) ( l"*Rp ) = 0 . (27) 



As l'*e' = e'*l' and \"*{Bp) = (Bp)‘ 1", equation itz'A can be rewritten as 



(^P) 






— e 



(Bp)* 



1 " = 0 



(28) 



Observe that the expression between square brackets is just the 3 x 3-matrix 
[Tp] given in formula (HDD . The {i,j)th entry of this matrix is 

[Tp]ij = (Ap), (e")j - (e')i {Bp)j . (29) 

Using p = {x,y, 1)*, it follows from formula (HU that 

Tl^ = aik{e")j - bjk{e')i for 3, (30) 



as claimed by the theorem. 



□ 
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Remark 2. First of all, observe that the trifocal constraint (inj expresses a trilin- 

ear algebraic relation between (the homogeneous coordinates of) 1 image point 

and 2 image lines. In tensor terminology, this is expressed by stating that the 

trifocal tensor has one covariant and two contravariant indices, viz. k and ij 

respectively. Furthermore, it should also be emphasized that the lines 1' and \" 

in the theorem do not need to be corresponding lines in the images. Secondly, 

the attentive reader will have noticed the scalar factor 4- in the definition of 

Pe 

the matrix A (and a similar factor in that oi B), which was not present in the 
definition of A in section 0 Due to these factors, the scalars p'^ and p" could be 
factored out in the equations of system (EE|); thus resulting in nicer looking for- 
mulas. Since the epipolar constraint dH) in Theorem Q] is linear and homogeneous 
in the entries of the fundamental matrix F, defining A as A = -^ K' R'^ RK~^ 
in section 01 as well, will not affect the epipolar constraint II 1 411 at all. More- 
over, it is observed in Remark [0 that the fundamental matrix F can only be 
retrieved from the image data up to a non-zero scalar multiple. So, replacing A 
hy A = ^ K' R'^ RK~^ will not even be noticed in practice. 

4.2 The Trifocal Constraint for Corresponding Lines 

The trifocal constraint JED implies relations between corresponding points and 
between corresponding lines in three views, as we will show next. Recall that 
equation dm actually expresses algebraically that the line 1 in the first image, 
which corresponds to the lines F and 1" in respectively the second and the third 
image, passes through the image point p = (a;,?/, 1)‘. Hence, the equation of 1 
must follow directly from equation (El- 

Proposition 1. (Trifocal constraint for lines) jp.1^ With the same notations 
as in Theorem m If V and 1" are corresponding lines in the images I' and I" 
respectively, then the homogeneous coordinates of the corresponding line 1 in the 
first view I are (up to a non-zero scalar factor) given by 

1 ~ ( F*Til" , l'*T2l" , l'%l” ) * , (31) 

where Tk is the 3x3-matrix whose (i,j)-th entry is T)) . 

Proof. It follows directly from formula (II8II that 

[Tp]=Tix + T2y + n . (32) 

Hence, equation dm can be written as 

F‘ ( Ti X + T2 2/ + Ta ) 1" = 0 ; (33) 

or equivalently, 

(F* Ti 1") X -k (f‘ T 2 1") y + (f‘ Ta 1") = 0 . (34) 

As p can be any point on the line 1, the latter formula actually is the equation 
of the line 1 ~ ^F*Til" , F*T 2 l" , F^Tal"^ in the first image. 



□ 
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Remark 3. Whereas the epipolar constraint m is the “simplest” — in the sense 
that it involves the smallest number of views — constraint that must hold be- 
tween corresponding points in different images, so is equation the “simplest” 
constraint that must hold between corresponding lines in different images. Indeed, 
if only two views of the scene are given, then any pair of lines in the two views 
might be corresponding lines, because they can always be interpreted as being 
the images of the intersection line L of their projecting planes, as observed be- 
fore. To verify whether this is indeed the case, a third view — and Proposition ^ 
— is needed. In fact, formula (ED actually predicts where the corresponding 
line is to be found in the image. In other words, equation (ED allows to transfer 
lines from two images to a third one. In this respect, it is worth noting that the 
geometric construction underlying Theorem El and Proposition^ degenerates if 
the projecting planes tt' and tt" of the image lines 1' and 1" coincide. This hap- 
pens precisely when 1 and \" are corresponding epipolar lines in the views I' and 
I” respectively. Theorem El however, remains valid even in this case. Finally, it 
also interesting to note that the matrices Tk of the proposition can be written 
in matrix form as 

Tfc = SLk e"* — e' for 1 < fc < 3, (35) 

with e' and e" the epipoles of the first camera in the second and the third views 
respectively, and where aj, and denote the fcth column of the matrices A 
and B of Theorem El Clearly, the range of Tk is contained in the linear sub- 
space of Sff spanned by and e'. Hence, except for certain special camera 
configurations Eng, Tk is of rank 2. 

4.3 The Trifocal Constraints for Corresponding Points 

On the other hand, the trifocal constraint m in Theorem El holds for every 
choice of lines F and 1" through the corresponding points p' and p" in images I' 
and I" respectively. All the lines \" through a given point p" form a pencil of lines 
(i.e. a 1-parameter family of lines) in image I", whose top is the point p". Recall 
from analytic geometry that the homogeneous coordinates of any line \" in the 
pencil can be expressed as a linear combination of the homogeneous coordinates 
of two arbitrary, but fixed lines F/ and I 2 in that pencil. When working with 
rectangular images, a natural choice for 1" and I 2 are the horizontal h" and the 
vertical v" line passing through the image point p", as depicted in Figure El 
Hence, every line \" through the point p" in I” can be written as 

1" = a h" -I- /3 v" for some a, P G IR. (36) 

The trifocal constraint (1171) can thus be re-expressed as 

a F‘ [Tp] h" -f P F‘ [Tp] v" = 0 for all a,PG JR; (37) 

or equivalently, by the system of equations 

f F* [Tp] h" = 0 

1 F‘ [Tp] v" = 0 



(38) 
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Fig. 5. Every line \" through the image point p" can he expressed as a linear 
combination of (the homogeneous coordinates of) the horizontal line h" and the 
vertical line v" through p". 



because a and (3 are arbitrary real numbers. If the image coordinates of p" are 
{x",y") — i.e. p" = {x” ,y”, 1)‘ — then the horizontal line h" through p" has 
equation y = y” and the vertical line v" through p" has equation x = x" . The 
homogeneous coordinates of h" and v" thus respectively are 

h" ~ (0, — 1, t/")‘ and v" ~ (1, 0, — a:")* . (39) 

System yields two algebraic equations involving the coordinates of corre- 
sponding image points p and p" in respectively the first and the third view and 
an arbitrary line 1" passing through p' in the second view. This relationship will 
be discussed further in section lOfP roposition0l below. 

Similarly, the line b passing through the point p' in the second image can 
be written as a linear combination 1 ' = a' h' + /3' v' of the horizontal line h' ~ 
(0,-1, y'Y and the vertical line v" ~ (1, 0, —x')* throught p' = {x' , y) 1)‘ in 
Substituting this expression for 1 ' in system Ij.SSII yields 

r a' h'‘ [Tp] h" + /?' [Tp] h" = 0 

1 a' h'‘ [Tp] v" + /3' [Tp] v" = 0 ^ ^ 



Since these equations must hold for all a' ,(}' G M, it follows that 

{ h'‘ [Tp] h" = 0 

"" [^P] h" = 0 

h'* [Tp] v" = 0 
v'‘ [Tp] v" = 0 

These four, linearly independent, equations, which together are equivalent with 
the trifocal constraint m of Theorem El express the algebraic relations that 
must hold between the coordinates of corresponding image points in three views. 
Substituting the homogeneous coordinates of h', v', h" and v", and expanding 
the expressions, yields the well-known formulas calculated in m (see also m)- 
An easier way to remember and use them is in the following form. 
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Proposition 2. (Trifocal constraints for points) If I, I' and I" are 

three views of a static scene, then each triple p, p' and p" of corresponding 
points in I, I' and I" respectively satisfies the matrix equation 

[p]x [Tp] [p']x = 03 , (42) 

where [Tp] is the SxS-matrix defined in Theorem, fH h form, via, J /8l) ). Moreover, 
of the 9 relations collected in equation m only the 4 constraints that constitute 
the upper 2x2-submatrix are linearly independent relations. 

Proof. As h', v', h" and v" are column vectors and [Tp] is a 3 x 3-matrix, 
system (EB can be written in matrix form as 

(lj!t)[^P](h"v")= 02 . (43) 

Substituting the homogeneous coordinates of h', v', h" and v", gives 




Note that the leftmost matrix is formed by the first two rows of the skew- 
symmetric 3 X 3-matrix [p'j x that represents the cross product with the 3- vector 
p' = {x\ y' , 1), as defined in formula (f 1 3y . Also recall that the rank of the matrix 
[p'j X is 2. As the first two rows of [p'j x clearly are linearly independent, the third 
row of [p'] X must be a linear combination of the first two. And indeed, 

row 3 = —x' ( row 1) — y' { row 2 ) . (45) 

Because the trifocal constraint CB) is linear in the 3-vector 1', adding this third 
row to the leftmost matrix of equation (EB will just add another 2 valid (but 
linearly dependent) relations to system 14:111 . In fact, we just add the 2 equations 
in system (HI 111 with a.' = —x' and j3' = —y' to the matrix equation (I44|l (or 
equivalently, to system (14 1 II V Hence, matrix equation 14411 becomes 



/ 0 


-1 


y'\ 1 


< 0 


1 \ 




1 


0 


-x' [Tp] 


-1 


0 = 03x2 • 


(46) 




x' 


0 \ 


. y" 


-x” ] 





Similarly, the columns of the rightmost matrix in the lefthand side of equa- 
tion (l4(ill are, up to sign, the first two columns of the 3 x 3-matrix [p"j x that 
represents the cross product with the 3- vector p" = (cc", j/", 1)*. And again, the 
third column of [p"j x can be written as a linear combination of the first two: 

column 3 = —x" ( column 1) — y' { column 2 ) . (47) 

By the linearity of the trifocal constraint 1 1 Yll in the 3- vector 1", adding this 
column to the rightmost matrix in the lefthand side of equation ll4pll will just 
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add another 3 valid (but linearly dependent) relations to system ll4hil . This brings 
us to 



/ 0 


-1 


y'\ 1 


' 0 


1 


-y”\ 




1 


0 


-X' [Tp] 


-1 


0 


X” =03, 


(48) 


V-2/' 


x' 


0 / \ 


. V" 


-x" 


0 / 





which, up to sign, is formula (I42II of the proposition. □ 

4.4 The Trifocal Constraints as Incidence Relations and as Transfer 
Principle 

Before going on, it is worth summarizing the different relations between three 
views, which have been derived yet. Table Ogives an overview of the different 
trifocal relations ordered by the number and type of corresponding and incident 
geometric image features involved. The “constraint” number refers to the for- 
mula number where this constraint is expressed, and the “number of equations” 
refers to the number of linearly independent equations that exists for the ho- 
mogeneous coordinates of the image features involved in that particular type of 
constraint. Except for constraint m, which prediets the position of the line 



image features 


constraint 


no. of equations 


three points 


@3) 


4 


two points, one line 


ij33) 


2 


one point, two lines 


C3) 


1 


three lines 


m 


2 



Table 1. Overview of the different types of trifocal constraints. 



in the first view that corresponds to two given lines in the second and the third 
view respectively, all the (other) constraints express geometric incidence rela- 
tions between the image features in the three views. In this form, the relations 
are well-suited for verifying whether particular image features — specified by 
their image coordinates — in the different views might be eorresponding features 
of the image triplet. The trifocal constraint 43 1 1 for lines can also be expressed 
in this form by eliminating the common scalar factor. 

Corollary 1. (Trifocal constraint for lines revisited) Let I, I' and I” he 

three views of a static scene. Three lines 1, 1' and \" are corresponding lines in 
I, I' and I" respectively — i.e. 1, 1' and \" are the projections of one and the 
same 3D line in I, I' and I" — if and only if 

ffi(l'‘T3l")- 4(l'‘lTil") = 0 
[£2(1'‘T31")- 4(1'‘T21") = 0 

where Tk are the 3x3-matriees defined in Proposition^ and with 1 = (^i,f 2 ,^ 3 )*- 

□ 
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Remark 4- When corresponding image features are identified in the three views, 
then the relevant incidence relations summarized above all bring homogeneous 
linear equations in the entries of the trifocal tensor T. Hence, T can be com- 
puted linearly, up to a non-zero scalar factor, from these equations provided suf- 
ficient corresponding points and lines can be identified in the three views. More 
precisely, the trifocal constraint for corresponding image points yields four 
linearly independent equations in the entries of T for each triple of corresponding 
points in /, /' and I” ; and, the trifocal constraint E3) for corresponding image 
lines brings two linearly independent equations in the entries of T for each triple 
of corresponding lines in /, I' and I" . Because T has 3^ = 27 entries which have 
to be determined up to a non-zero scalar multiple only because all the relations 
in Table Q are linear in the entries of T, T can be computed linearly, up to a 
non-zero scalar factor, from Up point and ni line correspondences between the 
three views if 4 np-\- 2 ng >26 US]. Consequently, T can be determined linearly, 
up to a non-zero scalar factor, from a minimum of 7 point correspondences or 
13 line correspondences alone. But, as has been observed in section 14.21 each 
of the matrices defined in Proposition [D (see also formula OSSji) has rank 2. 
Moreover, it has been proven in m that the rank of the trifocal tensor T is 
4. In fact, the entries of T satisfy 8 non-linear algebraic relations such that 
T actually only has 18 degrees of freedom f/l22j . Due to the presence of noise 
in the images, the tensor T computed linearly from point and line correspon- 
dences between the images generally will not satify these non-linear relations. 
Imposing these relations in the computation of T, however, results in non-linear 
criteria |1 8lfil4fi| . A robust method for computing T can be found in m with 
improvements in m- 



Apart from having incidence relations which can be used to verify whether 
specific image points and lines in the different views are corresponding features, 
formulas such as the trifocal constraint (ED for lines, which predict where the 
corresponding feature in one view must be when their positions in the other 
two views are known, are also very useful. Therefore, we will now show how the 
trifocal constraints for points can be used to transfer points from two views 
to a third one. 



Corollary 2. (Trifocal constraint for points revisited) Let I, I' and I" be 

three views of a static scene. If p and p' are corresponding points in the images 
I and I' respectively, then the (extended) coordinates of the corresponding point 
p" in the third view I" are (up to a non-zero scalar factor) given by 

p" p" = [Tp]l - x' [Tp]l and r" p" = [Tp]L - v' [Tp]l , (50) 



where [Tp]fc* denotes the \zth row of the 3x3-matrix [Tp] defined in Theorem^ 
and with p,r G IR being non-zero scalar factors. In case of noise-free data, both 
equations are equivalent. 
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Proof. Recall from Proposition 0 that only the upper 2 x 2-submatrix in for- 
mula da yields linearly independent equations for p". This submatrix is given 
by formula (P|) : 



0-1 y' 



1 0 -x' 



0 1 



,, [Tp] -1 0 = O 2 



(51) 



y" -x" 



Selecting the first row in the left matrix and the first column in the right matrix 
of the lefthand side of the equation, yields the relation 



(0-ly')[Tp] ( -1^^ 1=0; 

or equivalently, when solving for y” , 

{ [Tp]23 - y' [Tp]33 } y" = [Tp]22 - y' [Tph2 , 



(52) 



(53) 



where [Tp]ij is the {i,j)th entry of the 3x3-matrix [Tp]. Similarly, the first row 
in the left matrix and the second column in the right matrix of the lefthand side 
of equation (EJ, yields the relation 



(0-ly')[7^P] 




(54) 



or equivalently, when solving for x" , 



{ [Tphs - y [Tp]33 } X = [Tp] 2 i - y' [Tpjai . (55) 



Equations (ESI and (It) hi) can be combined into the following matrix equation: 



!x"\ /[Tp]2i-2/'[Tp]3i\ 

{[Tp]23-y'[Tp]33}h/" = [Tp]22-y'[Tp]32 • (56) 

\l ) {[TpU-y^lTpUj 



Putting r" = [Tp ]23 — y' [Tp ]33 and using p" = (x",y",l)‘ transforms this 
equation exactly into the right equation of formula dsni in the proposition. The 
left equation in formula (EUl follows in exactly the same manner if one repeats 
this reasoning with the second row of the leftmost matrix in equation (E|) . □ 



Not only the trifocal constraints for points and for lines can be used to trans- 
fer corresponding image features from two views to a third one. The 2 points / 
1 line relation (IdiSIl mentioned in Table H also provides an interesting transfer 
principle which goes as follows: Suppose a point p in the first view I and a line 
1 ' in the second view I' are given. Moreover, suppose that the line 1 ' actually is 
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the projection in I' of a line L in the scene that passes through the scene point 
P of which p is the projection in I ; but, that, for some reason or another, you 
are not able to identify the projection p' of P in I' . If the camera parameters of 
both cameras would be known, then the ray of sight creating the image point p 
in the first camera can be computed from formula Q; and, the projecting plane 
generating the line h in the second camera is given by formula dUl). The inter- 
section of that ray of sight with this projecting plane gives the position of the 
underlying point P in the scene, as depicted in Figure 0 Generally, the camera 




Fig. 6. A line F in the second image, containing the (possibly unknown) point 
p' corresponding to a point p in the first image suffices to compute the position 
of the point p", corresponding to p and p', in the third view. 



parameters are not known, but the projection p" of P in a third view can be 
determined from the 2 points / 1 line relation (I3!SI in Table Q as is shown in the 
following proposition. 

Proposition 3. (Point — line transfer) Let I, I' and I" be three views of 
a static scene. Ifp is a point in image I and F is a line in image I' which contains 
the (possibly unknown) point p' of I' corresponding to p, then the (extended) 
coordinates of the point p" in the third view I" , corresponding to p (and p'), 
are (up to a non-zero scalar factor) given by 

r"p" = [Tp]*F, (57) 

where [Tp] is the 3x3-matrix defined in Theorem\^ and with t" G M being a 
non-zero scalar factor. 
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Proof. Recall that relation consists of the following system of equations: 

( 58 ) 

[Tp] v" = 0 

where h'' and v" are column vectors representing homogeneous coordinates 
of the horizontal and the vertical line through p" in /" respectively (cf. for- 
mula (03)). Substituting the homogeneous coordinates h" ~ (0,— l,j/")‘ and 
v" ~ (1,0, —x”y for h" and v" in the system (0SJ yields two equations that can 
be solved for x” and y” explicitly, as we did in the proof of Corollary 01 But, the 
system can also be solved in a more analytical geometry-like manner. Indeed, 
system (I58II can be interpreted as stating that the 3-vector 

(r‘[Tp])‘ = [Tp]M' (59) 

are homogeneous coordinates of an image point that lies both on the lines h" 
and v". Since, by construction, p" is the point of intersection of h" and v", the 
proposition follows. □ 

It is an easy exercise to verify that the geometrical construction of the point 
p" explained just before the statement of Proposition 01 really coincides with the 
point p" defined by formula (03) in the proposition. Indeed, given the image 
point p in the first image, the parameter equations of the ray of sight observing 
p in the first camera, according to formula are 

P = C + pRK~^p with p G IR ; (60) 

and, by formula (12 1 D in Lemma D the equation of the projecting plane tt' gen- 
erating the line 1' in the second camera is 

l'*K'R'\P-C')=0 . (61) 

The point P of intersection of that ray of sight with the projecting plane tt' is 
found by substituting expression (Itil )ll into formula (ED : 

^K'R'\C-a) + pK'R'*RK~^p^ = 0 . (62) 

Recall from formula (il I ll that K'R'^{C — C' ) = p'^e' gives the epipole e' of 
the first camera in the second image, and from Theorem 01 that the matrix A in 
the definition of the trifocal tensor T is defined by A = ^ K' R'^ RK~^ . Hence, 
equation (El is equivalent to 



l'‘e' -k p V*Ap = 0 . (63) 

This latter formula fixes the value of the parameter p of the scene point P in 
equation lloutl . 
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The projection p" of P in the third view, on the other hand, is given by the 
projection equations (cf. formula EJ : 

p" p" = K''R”^ ( P - C" ) . (64) 

Substituting expression (isni for P in this equation, gives 

p” p" = K”R''' ( C - C" ) + p K''R”'rK-^ p . (65) 

Using, as before, that K''R”*{ C — C" ) = p" e" gives the epipole e" of the first 
camera in the third image, and that the matrix B in the definition of the trifocal 
tensor T is defined by i? = ^ K” R”^ RK~^ , one obtains 

p"p" = p"e" + pp"Bp . (66) 



Eliminating the non-zero scalar p from equations and lltitil) , and dividing by 
the non-zero scalar p", one finds that 



r"p" = e"(r‘Ap)-(l'V)i?p , 



(67) 



where r" = ^ ( V*Ap). As I'^Ap = ( Ap )* 1' and l'‘e' 
be rewritten as 



r p 



e" ( Ap)* - (Bp)e'* 



= e'*l', equation (E3 can 
1 ' . ( 68 ) 



The expression between square brackets is the transpose of the 3 x 3-matrix 
[Tp] in formula 1191) of Theorem |2| This again proves Proposition 0 



Remark 5. In terms of projective geometry, the geometrical construction de- 
scribed here is the homography that maps the first image plane onto the third 
one via the projecting plane of the line b in the second image It is also 
interesting to note that the projection p' of the scene point P on the line 1' in 
the second view, which initially may be unknown, can now be calculated using 
the point transfer principle formulated in Corollary 0 (when switching the roles 
of r and I" , of course). Furthermore, it is worth mentioning that this geomet- 
rical construction underlying Proposition 0 degenerates if the ray of sight of p 
in the first camera lies into the projecting plane tt' of the line b for the second 
camera. This situation happens when b is the epipolar line corresponding to p 
in the second view. 



4.5 The Connection Between Epipolar and Trifocal Constraints 

At first sight, there does not seem to be a direct connection between the epipolar 
and trifocal constraints discussed above. The epipolar constraint dH gives a 
relation between (the coordinates of) corresponding image points in two views, 
whereas the fundamental trifocal constraint dnj expresses a relation between 
an image point in the first view and arbitrary lines through the corresponding 
point in the second and in the third view. The connection, however, lies in the 
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word “arbitrary”. Indeed, it was observed in section ITTH that all the lines passing 
through a given point in the image form a pencil of lines (i.e. a 1-parameter family 
of lines) in that image, having the given point as top (see Figure 0). In other 
words, instead of considering 1 point in an image, it is completely equivalent to 
consider all the lines through that point in the image. At the scene level, an 
image point defines a ray of sight generating this point in the camera. A pencil 
of lines in the image, on the other hand, generates at the scene level a pencil 
(i.e. a 1-parameter family) of planes — viz. the projecting planes of the lines in 
the image — which all have the ray of sight of the top of line pencil in the image 
in common. So again, considering a ray of sight in the world space is completely 
equivalent to considering a pencil of projecting planes having this ray of sight 
in common. Remember that the epipolar constraint mi expresses algebraically 
that, in order for a point p' in the second image I' to correspond to a point p 
in the first image I — i.e. in order for p and p' to be the projections of one 
and the same scene point P — p' must ly on the epipolar line corresponding 
to p in and that this epipolar line actually is the projection in /' of the ray 
of sight generating the image point p in the first camera. Put differently, the 
epipolar constraint essentially expresses that in order for two image points p G / 
and p' G I' to be corresponding, the two rays of sight generating p and p' in 
respectively the first and the second camera must intersect in the world space. 
Now, if one interpretes the pencils of lines through the image points p' and 
p" in the trinocular case as projecting planes ruling out the rays of sight of p' 
and p" in the world space, then the trifocal constraint li I vll basically expresses 
that the three rays of sight defined by the three corresponding image points 
must intersect in one and the same scene point. Indeed, all lines F through p' 
in the second image rule out all possible world planes containing the projection 
center C' of the second camera and the scene point P of which p, p' and p" 
are the projections in the respective images. Similarly, all lines 1" through p" 
in the third image rule out all possible world planes containing the projection 
center C" of the third camera and the scene point P. The intersections of the 
planes in the two pencils generate a family of 3D lines in the world space, all 
containing P. The trifocal constraint (El), as proven in section El states that 
the projection of these lines in the first view all must contain the image point p ; 
or equivalently, that the ray of sight creating p in the first camera must intersect 
all the 3D lines generated by intersecting the projecting planes of the lines in 
the other views in the scene point P. So, basically both constraints express the 
same geometrical condition, albeit involving a different number of views and in 
a slightly different manner. Consequently, it should be possible to express both 
constraints algebraically in the same way. 

Consider again the binocular case, and let p and p' be corresponding image 
points in respectively the first and the second view. Then p and p' satisfy the 
epipolar constraint (m : P'" Fp = 0. Following the reasoning above, the point 
p' is the top of a pencil of lines in the second image. So, two arbitrary, but 
distinct lines 1^ and I 2 of that pencil intersect in the image point p'. Recall that 
1']^ and I 2 are represented as 3-vectors containing homogeneous coordinates for 
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the image lines. Their cross product x I 2 then yields homogeneous coordinates 
for the point of intersection of and I 2 . Since 1^ and I 2 intersect in p', it follows 
that Ij X I 2 = pp' for some non-zero scalar p G M. Substituting this cross product 
for p' in the epipolar constraint 111 gives 

(i; xl' )*i^p = 0 (69) 

for all lines 1^ and I 2 through p' in the second image. On the other hand, recall 
from Theorem 0 that F = where e' is the epipole in the second image 

and A is the 3x3-matrix A= K' R'^ RK~^ . Hence, f p = [e'Jx^p — e' x Ap 

and equation (ltit)ll can be written as 

(l'iXl')‘(e'xHp) = 0 (70) 



for all lines 1^ and I 2 through p' in the second image. Note that the lefthand side 
of equation (lYl)ll . in fact, is the dot product of two cross products. From linear 
algebra we know that it is equal to 

( 1) X 1' )‘ ( e' X Hp ) = ( l)‘e' ) ( 1' ‘Ap ) - ( i;‘ Ap ) ( 1' ‘e' ) . (71) 

As 12* Ap = ( Ap )* I 2 and 12 ‘e' = e'‘l' 2 , equation can be rewritten as 



l'i‘ e'(Ap)‘ 



(Ap)e'* 



1; = 0 



(72) 



for all lines 1^ and I 2 through p' in the second image. Observe that the middle 
matrix in the lefthand side of equation (C2I) is, up to sign, exactly the 3x3-matrix 
[Tp] of formula (fTTl in the fundamental trifocal constraint jni) of Theorem 0 
provided the seeond and the third image are identieal; i.e. I' = I" , e' = e" and 
A = B. 

Proposition 4. [2] Let I and I' be two views of a static scene. If p and p' 
are corresponding points in the images I and I' respectively, then the following 
relation is equivalent to the epipolar constraint PI for p and p' ; for every two 
lines 1^ and I 2 through p' in I' , 



l'i‘ [Tp] 1' = 0 , (73) 

where T is the trifocal tensor of the image triple {I, I', I'), as defined in Theo- 
rem^ □ 



In other words, by replacing an image point by (the intersection of) two arbi- 
trary lines through that point in the image, the epipolar constraint can be 
written as a trifocal constraint in which the second and the third view coin- 
cide] and, vice versa, by identifying the second and the third view in the trifocal 
constraint the epipolar geometry of the image pair is recovered. In |Ij (see 
also |2| ) explicit formulas are given on how the trifocal tensor of an image triple 
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changes when the third camera is moved in the world. This allows to generate 
novel views of the scene from three given ones by using the trifocal constraints 
as a transfer principle. Moreover, these observations are used in |2] as a common 
framework, based on the trifocal tensor, for describing the multiview relations 
in image sequences. 



5 Relations Between Four Views: The Quadrifocal 
Constraints 

The methodology developed in the previous section to connect the epipolar con- 
straint with the trifocal one can bring us on the track of the quadrifocal con- 
straints that must hold between four views of a static scene. The fundamental 
trifocal constraint jnD expresses a relation between an image point in the first 
view and arbitrary lines through the corresponding point in the second and in 
the third view. Following the methodology of section 14. 5L the image point p 
in the first view can be interpreted as the top of a pencil of lines through p 
in the first image. Algebraically, the point p in the fundamental trifocal con- 
straint (DU) can be replaced by the cross product li x I 2 of two arbitrary lines li 
and I 2 passing through p in the first view. In this way, an algebraic relation is ob- 
tained between homogeneous coordinates of four arbitrary lines passing through 
corresponding points in the images. This relation is a degenerate case of the 
fundamental quadrifocal constraint m, which occurs if one identifies the first 
and the second of the four views in Theorem 0 Hence, the fundamental quadri- 
focal constraint is a relation between (homogeneous coordinates of) four lines in 
the respective images. But what does this relation express geometrically? Well, 
recall from section tt. II that the fundamental trifocal constraint expresses 
algebraically that the projecting planes tt' and tt" of the lines 1' and 1" in respec- 
tively the second and the third view intersect in a 3D line L in the world space, 
which contains the scene point P of which the corresponding image points p, p' 
and p" are the projections in the respective views; or equivalently, that the ray of 
sight observing the image point p in the first camera must intersect the 3D line 
L (consequently, in the scene point P). Now, if the image point p is replaced 
by a pencil of lines through p in the first view, then each line of this pencil 
defines a projecting plane tt in the world space, containing the ray of sight of p 
in the first camera. Since the ray of sight of p in the first camera is just the 3D 
line connecting the projection center C of the first camera with the scene point 
P, each such projecting plane tt must contain both C and P. So, two arbitrary 
lines li and I 2 passing through p in the first view, generate two world planes 
7Ti and 7T2 both containing the projection center C of the first camera and the 
scene point P underlying p in the first image. Imagine that the projection center 
C would be split into two space point C\ and C 2 and moved apart. In other 
words, assume that the first camera is “decoupled” into two separate cameras, 
which are then moved apart. Then the following situation occurs: There are four 
cameras — and thus four images — with projection centers Ci, C 2 , C' and C" 
respectively. And, as before, the lines 1' and 1" in the images obtained by the 
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cameras at positions C' and C" still define, through their projecting planes, a 
3D line L in the world space, which contains the scene point P. The line I2 in the 
image obtained from position C2 generates a projecting plane 7T2 in the world 
space, which intersects the 3D line L in P. This situation is generic for every 
triple of lines through corresponding image points in three views and does not 
yield a constaint. The fourth view, however, introduces a constraint, because the 
line li in the image obtained from position Ci will contain the projection of the 
scene point P in this view if and only if the projecting plane tti of li contains the 
3D point P defined as the intersection of the (essentially arbitrary) projecting 
planes 7T2, tt' and tt" in the world. The fundamental quadrifocal constraint (Ei 
expresses this constraint algebraically in terms of (homogeneous coordinates of) 
the lines in the images. 

Theorem 3. (Fundamental quadrifocal constraint) 114715117^ For every 
four views I, I', I" and I"' of a static scene, there exists a 3 x 3x 3 x 3-tensor 
Q = Ki j k e <3 ’ called the quadrifocal tensor of the image quadruple 

with the following property: If p G I , p' G I', p" S /" and 
pill g Jill corresponding points in the images, then for every line 1 through 
p in I , for every line 1' through p' in I' , for every line \" through p" in I" and 
for every line V" through p'" in I'" , 

3 3 3 3 

E E E E = 0 , (74) 

i^i j^i k^i 



where 1*-*^ ~ , U 2 ) ^3 )* c,re homogeneous coordinates for the lines 1*-*^ (^0 £ 

i < a) in the images — the superscript indicating the number of primes. 
Moreover, if, with the notations as before, A = 4- K' R'* RK 

re 



-1 



B= -hr K"R" 

p" 



RK 



-1 






and C = -777 K" 

re 

( 5il &V1 

aji aj2 Uj3 
bki bk2 bk3 
V C£l C£2 Cl3 



R"'^RK 



-1 



then 




for 1 < i,j, k,e<3. 



(75) 



where 6ij denotes the Kronecker delta; Oij, bij and Cij denote the (i,j)th entry of 
the matrices A, B and C respectively; and with {e')j, (e")j and (e'")j being the 
jth coordinate of the epipoles e', e" and e'" of the first camera in respectively 
the second, third and fourth view. 



Proof. We just have to express that the four projecting planes of the image lines 
1, 1', \" and V" have the scene point P, projecting onto the corresponding image 
points p, p', p" and p'", in common. By Lemma Q this means that P satisfies 
the following system of equations: 



V-KR^ {V -C) 


= 0 


K' R'^ {R - a ) 


= 0 


l"^K”R"^ ( P - C" ) 


= 0 


1 

Q 


1 = 0 



( 76 ) 
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Or, expressed in matrix form, 



/ Far* 


-Faa‘c \ 




/n 




-F‘A'A'*C' 




0 




-i"‘a"a"‘c" 


10 = 


0 


Vf"*a'"a'"* 


-F"‘A"'A"'*C'"/ 




\0/ 



(77) 



As this homogeneous system of linear equations clearly has a non-zero solution, 
the leftmost 4 x 4-matrix must be singular. Because the rank of a matrix does 
not change if the matrix is multiplied with a regular matrix, the matrix 



/ 1 ‘A'R‘ 

i"‘a:"r"‘ 



-l^KR^C 

-I'^K'R'^C 

-1"*A"'R"*C" 




y ^/// j 



p'ju 

PeV'^B 

Vpri'"‘c 



Pe 1 ® 
pe 1 ® 

/// j//ft /// / 

Pe 1 e / 



(78) 



must be singular as well. In particular, the determinant of matrix CZ3) must be 
zero. After expansion and division by the non-zero factor PePePe", this determi- 
nant is the quadrifocal constraint of the theorem. As the determinant yields 
a quadrilinear expression in the entries of 1, 1', 1" and 1"', the entries of the 
quadrifocal tensor Q can be found by substituting the appropriate standard unit 
3-vector Ui = (1, 0, 0)‘, U 2 = (0, 1, 0)* or U 3 = (0, 0, 1)^ for 1, 1', 1" and 1"' in the 

determinant. More precisely, jg found by taking 1 = u^, 1' = Uj, 1" = Uk, 

V" — VLi. For every 3-row-ed matrix M, vl\M is equal to the Ah row of M. Hence, 
Q^jkl eq^fog determinant of the 4 x 4-matrix whose rows respectively are the 
Ah row of the 3 x 4-matrix {I 3 \ 0), the jth row of the matrix {A \ e'), the 

kth row of ( i? I e" ) and the £th row of ( C | e'" ) , as in formula iIyqIi . □ 

As in section H..3I the fundamental quadrifocal constraint IIY4II can be trans- 
formed into incidence relations involving the coordinates of one or more of the 
image points p, p', p" or p'" by taking for 1, F, 1" and / or V" the horizontal and 
vertical lines through the given image point(s). Table 0 gives an overview of 



image features 


no. of equations 


four points 


16 


three points, one line 


8 


two points, two lines 


4 


one point, three lines 


2 


four lines 


1 



Table 2. Overview of the different types of quadrifocal constraints. 

the different quadrifocal relations ordered by the number and type of geometric 
image features involved. Again, the “number of equations” refers to the number 
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of linearly independent equations that exists for the homogeneous coordinates 
of the image features involved in that particular type of constraint. It is impor- 
tant to note that, in the cases involving both points and lines, different types 
of relations are possible, depending on which particular view(s) the point(s) are 
taken from. As before, this variability is not taken into account in the “number 
of equations” . 

Remark 6. As with the epipolar and the trifocal constraints, if corresponding 
image features are identified in the four views, then the relevant incidence rela- 
tions mentioned in Table El all bring homogeneous linear equations in the entries 
of the quadrifocal tensor Q. Hence, Q can be computed linearly, up to 
a non-zero scalar factor, from these equations provided sufficient corresponding 
points and lines can be identified in the four views. More precisely, the quadri- 
focal constraints for corresponding image points yields 16 linearly independent 
equations in the entries of Q for each quadruple of corresponding points in /, 

I” and Because Q has 3^ = 81 entries which can be determined up to 
a non-zero scalar multiple only, one would expect that 5 point correspondences 
that could be identified in the four views would suffice to compute Q linearly, 
up to a non-zero scalar factor, because 5 point correspondences yield 80 homo- 
geneous linear equations in the entries of Q. Unfortunately, it is proven in m 
that for n < 5 point correspondences between the four views the homogeneous 
system of 80 linear equations in the entries of Q has rank 16 n— ( 2 ), where ( 2 ) is 
the binomial coefficient ( 2 ) = "*-” 2 "^^ ■ Thus, 5 point correspondences yield only 
70 linearly independent equations, which is not enough to solve for the entries of 
Q. For n > 6, on the other hand, 80 linearly independent equations are found. 
Consequently, Q can be computed linearly, up to a non-zero scalar factor, from 
(at least) 6 point correspondences between the four images pi t>) . Furthermore, it 
is proven in E2I that the quadrifocal tensor Q has rank 9. In fact, the entries 
of Q satisfy 51 non-linear algebraic relations, in addition to the scale am- 
biguity, such that Q actually only has 29 degrees of freedom Due to the 
presence of noise in the images, the tensor Q computed linearly from point cor- 
respondences between the images most certainly will not satify these non-linear 
relations. Moreover, it is clear that one cannot ignore the 51 non-linear relations 
in the 81 entries of Q and hope to get reasonable results. Imposing these relations 
in the computation of Q results in non-linear criteria. A practical and accurate 
algorithm for the compution of Q is given in unj. 

The quadrifocal constraints mentioned in Table Elexpress geometric incidence 
relations between between the image features in the four views. As in the trifocal 
case, these quadrifocal constraints can also be converted to transfer equations, 
predicting the position of an image feature in one view from the positions of 
related features in the other views. One case is worth mentioning here: Recall 
from the beginning of this section that three arbitrary lines in three views gen- 
erate three projecting planes in the world space, which generally intersect in a 
singe world point P, as depicted in Figure 0 The fundamental quadrifocal con- 
straint llV4pl gives a necessary and sufficient condition for the projecting plane of 
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a line in the fourth view to contain that particular world point P as well. Ob- 
viously, the projections of P in the different views are the corresponding image 
points mentioned in Theorem 0 Replacing the line V" in the fourth image by 
the horizontal and the vertical line through the point p"' in the fourth view, not 
only yields the 1 point / 3 lines constraints in Table 0 but also gives a means to 
compute the position of p'” in the fourth view. 



> 





Fig. 7. Three lines 1, h and \" in three images, all containing the (possibly un- 
known) projections of a scene point P, suffice to compute the projection p"' of 
P in the fourth image. 



Proposition 5. (3 lines transfer) Let I, I' , I” and be four views of a static 
scene. If 1, 1' and 1" are arbitrary lines in the images I , I' and I" respectively, 
containing the (possibly unknown) projections p, p' and p" of a scene point P, 
then the (extended) coordinates of the projection p'" o/P in the fourth view I'" 
are (up to a non-zero scalar factor) given by 



t'"p"' = 



ltjg(fc£)]l/l j// ^ 



(79) 



where is the Hxii-matrix whose {i,j)th entry is 1' ] is a 

ixH-matrix whose {k,t)th entry is V , and with t'" € IR being a non-zero 

scalar factor. 
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Proof. Before proving the proposition, let us first rewrite the fundamental quadri- 
focal constraint m in a more compact form. Let, for a moment, the indices k 
and £ be arbitrary, but fixed. Then all entries of the quadrifocal tensor Q 

obtained by varying the indices 1 < *, j < 3, but keeping k and £ fixed, define a 
3 X 3-matrix which we will denote by If 1 and 1' are two lines in respec- 
tively the images I and I', then 1' is a real number, depending on the 

choice of the indices k and £. When varying the indices 1 < k,£ < 3, one again 
obtaines a 3 x 3-matrix, which we will denote by 1']. The fundamental 

quadrifocal constraint now is equal to 



■\//t 



lt^Q{ke)^Y 



l'" = 0 . 



(80) 



Now, consider three lines 1, 1' and 1" in the images I, I' and I" respectively. 
Furthermore, let p"' be the projection in the fourth view I'" of the point of 
intersection of the projecting planes of 1, 1' and 1" in the world space. Then 
the (extended) coordinates p'" = (x'", y'", 1)* of p"' can be obtained from for- 
mula (I8UII by replacing the arbitrary line V in I'" by respectively the horizontal 
line h"' ~ (0,-1,?/"')* and the vertical line v'" ~ (1,0,— x'")* through p"' in 
I"' : 



i"‘ h'" = 0 

1"* [T[Q('=^)]l'] v'" = 0 



(81) 



Following the same argument as in Proposition El the system (liS II) can be inter- 
preted as stating that the 3-vector 






(82) 



are homogeneous coordinates of an image point that lies both on the lines h'" 
and v'". Since, by construction, p'" is the point of intersection of h"' and v'", 
the proposition follows. □ 



Remark 1. Observe that the geometrical construction underlying Proposition 
degenerates if (at least) two of the three projecting planes coincide. This situa- 
tion happens when (at least) two of the three lines 1, F and 1" are eorresponding 
epipolar lines in their images. Also note that, when p'" is known, the corre- 
sponding points p, p' and p" in the other images can be found by the point - line 
transfer principle described in Proposition El Finally, the attentive reader will 
remark that the projection p of the scene point P in the first image can also 
be obtained as the point of intersection of the image line 1 with the projection 
in / of the 3D line defined by the intersection of the projecting plane of p' and 
that of p'" in the world space, as given by formula E] in Proposition 01 The 
corresponding point p"' in the fourth view can then be found by the point - line 
transfer principle described in Proposition El Observe, however, that this latter 
construction needs the estimation of two trifocal tensors between different image 
triples. PropositionElhas the advantage of providing an explicit formula (lYtll) and 
an independent construction of p"'. 
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6 Relations Between n Views: The General Picture 

What about the relations between five views? Clearly, the “decoupling trick” 
of section 0 does not work anymore, because the fundamental quadrifocal con- 
straint is a relation between image lines and does not (explicitly) involve 
image points anymore. It will be shown in this section (Theorem 0 that for 
five or more views of a static scene there essentially are no other relations than 
the ones discussed before (i.e. the epipolar, trifocal and quadrifocal constraints 
between every combination of two, three or four views out of the given ones). 

6.1 Relations Between n > 5 Views 

Let us first consider the n-view constraints for (the extended coordinates of) 
corresponding image points. Hence, let ... , be n views of a 

static scene. You may think of the superscript as indicating the number of 
primes used in the previous sections. Furthermore, suppose that G 

e ... , G are corresponding points in the respective 

images, meaning that they are all the projection of a single scene point P in the 
different views. In particular, 

p(*) ^ ( p _ Ci ) for all 0 < f < n - 1, (83) 

where the world point and the 3 x 3-rotation matrix Hi indicate the position 
and orientation of the (i + l)th camera in the scene, Ki is its calibration matrix, 
and the G M are non-zero scalar factors. Or, equivalently, 

KiRl P - K,Rl C, - p(*) =0 for all 0 < i < n - 1. (84) 

All these projection equations can be expressed in matrix form as 

/ KoRo -KoRlCo 0 

KiRi -KiR\Ci 0 p' 

V Kn-lRn-l — K„-lRn-lCn-l 0 0 

(85) 

Observe that the column vector at the lefthand side of the matrix equation (EJ 
contains the variable parameters whereas the leftmost matrix contains all the 
camera and image information. So, one may consider this matrix equation as 
representing a homogeneous system of linear equations from which the world 
coordinates of the scene point P and the unknown scalars . . . , 

can be determined if the camera parameters and the projections p^*^\ • j 

p("-i) of P in the respective images are given. The entry equal to 1 in the 
column vector proves that this homogeneous system of linear equations has a 
non-zero solution; or equivalently, that the rank of the (leftmost) matrix 
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KoRl 


-KoRfCo 


p(0) 


0 


• “ \ 




K^R\ 


-KiR\Ci 


0 




. 0 


V 


1 

-Si e 

7 
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Rn—lRn—\^n—l 


0 


0 





is strictly less than the number of columns of that matrix. As Ki and Ri are 
3 X 3-matrices, and C^, and 0 are 3-vectors, the matrix (ESJ has 3 n rows 
and n + A columns. Thus, this rank condition on matrix m only yields a 
constraint if the number of rows is greater than the number of columns; i.e. 
3n > n -|- 4, or equivalently, if n > 2. In other words, there only is a constraint 
on the position of corresponding image points if there are at least 2 views of the 
scene; an observation that was already made in section 0 

Suppose that n> 2. Then the rank condition on matrix implies that all 
(n + A) X (n + 4)-submatrices of this matrix must be singular; or equivalently, 
that all (n-|-4) x (n-|-4)-submatrices of matrix IjiStijl must have zero determinant. 
Obviously, each (n -|- 4) x (n -|- 4)-submatrix is obtained by choosing n + A rows 
among the 3 n rows of matrix (EEJ; or equivalently, by deleting 3 n — (n -|- 4) = 
2n — 4 rows from matrix In this matrix each view is represented by 3 
rows, namely 

( K,R\ - K,R\C, 0 ... 0 0 ... 0 ) . (87) 

So, if none of the three rows corresponding to the {i + l)th view is contained 
in the fn + A) x (n -|- 4)-submatrix, then the {A + i+ l)th column of the (n + A) x 
(n -|- 4)-submatrix is zero; and, its determinant trivially becomes zero. Hence, 
only (n + A) x {n + A) -submatrices containing at least 1 row from each view yield 
a non-trivial constraint between the coordinates of corresponding image points. 
On the other hand, if a (n -I- 4) x (n -I- 4)-submatrix of matrix llxot contains 
exactly one row corresponding to image then the (4 -|- i -I- l)th column of 
this submatrix contains only one non-zero entry. Developing the determinant 
of this submatrix along this (4 -|- i -I- l)th column results in a relation between 
corresponding image points in the views ... , • • ■ , ; 

and, one ends up with a constraint between only n — 1 of the n views. So, view 
j(®) will participate in a multiview constaint if and only if at least two rows of the 
submatrix (liS7l) are contained in the {n-\- A) x {n-\- 4)-submatrix of matrix (jiStijl . 
Put differently, in order for the view to participate in a multiview constraint, 
at most 1 row of the suhmatrix may he deleted from matrix (E^) to form a 
(n-\-A) X (n-\- A) -suhmatrix. Since 3 n increases more rapidly with n than n-\-A, the 
number of rows in matrix increases more rapidly with the number of views 
that the number of columns. Consequently, there must be a minimal number of 
views n* for which the previous condition — i.e. deleting at most 1 row for each 
view — cannot be maintained for all of the n* views. This situation occurs 
if the number of rows to be deleted from matrix m (i-e. 2 n* — 4) is greater 
than the number of views involved; i.e. 2n* — A > n*, or equivalently, n* > A. 
In other words, a non-trivial constraint involving image points from all the n 
views • • • , only exists if n < 4. Put differently, all the multiview 
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relations that exist between the image coordinates of corresponding image points 
in 5 or more views of a static scene are expressed by the epipolar constraints 
between any pair, the trifocal constraints between any triplet, and the quadrifocal 
constraints between any quadruple of views among the given images. 

Next, let us investigate what happens when also image lines are involved. A 
line 1^*^ in image passes through the image point in if and only if 
the projecting plane generating the line 1^*^ in the (i + l)th camera contains the 
scene point P of which is the projection in According to Lemma Q P 
must therefore satisfy the equation 



= 0 ; (88) 

or equivalently, 

= 0 . (89) 

Writing this in matrix form in confirmity with matrix equation 1851 gives 



( 0 0 ... 0 ) 



/ 




1 










V 




in 


image /^®^ 



= 0 . (90) 



itself, the 3 rows (fiSYII in matrix (liSh|l are replaced by only 1 row, viz. 

( C, 0 0 ... 0 ) . 



(91) 



Let us express this more precisely. Suppose that in m of the n views 
. . . , corresponding image points p^*^ G 7^®^ will be used, whereas in the 

other k = n — m views image lines 1^-^^ G through the corresponding point 
p(l^ are being considered. Without loss of generality, we may assume that the 
points p*-®^ are taken from the first m views ... , ; and, that 

the lines 1^-1^ are taken from the last k images ... ^ Then 

the projection equations yield a homogeneous system of linear equations (as in 
formula dSSI)) whose matrix now is 



/ KoBf 



^m — lJ^rn—1 



-KoR^Co 

-K 

m — 

_ i(m)^ K 



,(o) 



0 

0 



(m-1) 



( 92 ) 



V 0 ... 0 / 



0 
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As before, this homogeneous system of linear equations must have a non-zero 
solution; or equivalently, the rank of this matrix is strictly less than the number of 
columns. Matrix (EH) has 3m + k rows and m -|- 4 columns. The rank condition 
therefore only yields a constraint if the number of rows is greater than the 
number of columns; i.e. 3 m -I- fc > m -I- 4, or equivalently, if 2 m -I- fc > 4. In other 
words, there only is a multiview constraint involving image points and lines if 
either there are at least 2 points in 2 views, or 1 point and 2 lines in 3 views, 
or 4 lines in 4 views. Remark that this is in agreement with respectively the 
epipolar constraint (HU for 2 views, the fundamental trifocal constraint (O for 
3 views, and the fundamental quadrifocal constraint G3) for 4 views. 

Suppose that indeed 2 m -I- fc > 4. Then the rank condition on matrix 
implies that all (m-l-4) x (m-|-4)-submatrices of (1921) must have zero determinant. 
Each (m-|-4) x (m-|-4)-submatrix is obtained by choosing m-|-4 rows among the 
3m + k rows of matrix (El; or equivalently, by deleting {3m+ k) — {m -I- 4) = 

2 TO -|- fc — 4 rows from matrix (02J ■ We already know that, in case of an image 
point at most 1 row may be deleted for the submatrix (liS7j) corresponding to 
image 7^*^ in matrix H9XI) in order for the view 7^*) to participate in the constraint. 
On the other hand, we also know that, in case of an image line the image 
7^-^^ is only represented in matrix (El by the single row ED- Therefore, this row 
may not be deleted in order for image 7^-^^ to participate in the constraint. Put 
differently, for all views to participate in the constraint, only rows corresponding 
to image points may be deleted from matrix iH'H) . and, moreover, at most 1 of the 

3 rows corresponding to an image point may be removed. The minimal number 

of views n* for which this condition cannot be maintained, is reached when the 
number of rows to be deleted from matrix El (i.e. 2 TO -|- fc — 4) is greater 
than the number of image points involved; i.e. 2 m + k — 4 > m, or equivalently, 
TO-|- A; > 4. As TO-|- /c = n, the number of views, a non-trivial constraint involving 
image points and lines from all of the n views . . . , 7fo“^i only exists 

if n < 4. This proves the following theorem. 

Theorem 4. (Relations between n > 5 views) [47li48J All multiview re- 
lations that exist between (homogeneous coordinates of) image points and lines 
in five or more views of a static scene are expressed by the epipolar constraints 
between any pair, the trifocal constraints between any triple, and the quadrifocal 
constraints between any quadruple of views among the given images. □ 

6.2 Have All Relations Between n < 4 Views Been Found? 

Critical readers may remark that the analysis performed in section proves the 
statement in Theorem El but that it is not at all clear that the determinants of 
the (to-|- 4) X (TO-|-4)-submatrices of matrix (1921) are equivalent with the epipolar, 
trifocal and quadrifocal constraints derived in the previous sections. Therefore, 
we will now investigate these determinants more closely and show that each 
determinant corresponds to one of the relations derived earlier. But first, an 
observation is made, which simplifies the calculations significantly. Recall from 
linear algebra that the rank of a matrix does not change when it is multiplied 
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with an invertible matrix. So, the rank condition underlying the results of this 
section will not change if matrix is multiplied on the right by the invertible 
(to + 4) X (to + 4)-matrix 





Co 


0 


0 


. 0 


0 * 


1 


0 


0 


. 0 


0 


0 


1 


0 


. 0 


0 


0 


0 


He 


. 0 


0 


0 


0 


0 





(93) 



The resulting matrix can be simplified by using the notation Ai for Ai = 
{\ / p^e'’) KiR\R{^K^^ in the first column; and, by observing that the expres- 
sion KiRl (Co — Ci) = pe'^ in the second column gives the epipole of 
the first camera in the (i + l)th image The non-zero scalar factors pe'^ in 
the rows of the resulting matrix can then be removed by pre-multiplying it with 
the inverse of the following (3 to -|- fc) x (3 to -|- A:)-diagonal matrix 



/ h 


0 


0 


0 


0 


o‘ 
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0‘ 


0‘ 
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0 


o‘ 


0‘ 


0‘ 


p("*> . 
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V o‘ 


0‘ 


0* 


0 . 





(94) 



The resulting (3 to-|- fc) x (to -I- 4)-matrix, which satisfies the same rank condition 
as matrix is 



/ h 


0 pW 


0 


0 


\ 




Ai 


eA) 0 


PA) . 


0 






-^m — 1 


e(™-i) 0 


0 


. pt”" 


-1) 


(95) 
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A-\.m 
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0 










0 


0 







Moreover, the diagonal form of the matrix (It) 411 guarantees that the determinants 
of the (to -I- 4) X (to -I- 4)-submatrices formed from this matrix (It) dll only differ 
up to a non-zero scalar factor from the determinants of the corresponding (to -I- 
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4) X (m + 4)-submatrix of the original matrix Put differently, exactly the 
same constraints are found when performing the same operation either on the 
original matrix or on the transformed matrix 

Before going on, it might be useful to indicate how this algebraic transfor- 
mation of the problem relates to the geometry it describes. To this end, observe 
that the first 4 columns of the original matrix actually contain the camera 
matrices Mi = {KiR\ \ —KiR\Ci) for 0 < i < n — 1 — eventually pre- 
multiplied with the transpose 1^*^* of the line coordinates 1^*^ — of the camera 
set-up in the world space. In the transformed matrix on the other hand, 
the first 4 columns are {I 3 \ 0 ) for the first 3 rows, and (Ai \ ) for the 

others (1 < i < n — 1) — again eventually pre-multiplied with the transpose 1^*^* 
of the line coordinates 1^®^. Recalling that Ai = {\ j KiRlR^K^^ and that 
KiR\ ( Co — Ci ) = pe '' it is easy to see that 

Mo = ( KiiRl I -KoRl Co ) = ( /3 I 0 ) 



KoRl -KoRiC 



qIXq ^0 
1 



and M = {K^Rt I -K^RjC,) = {A, \ e« ) J, 

(96) 

Thus, when multiplying the original matrix (El on the right with the ma- 
trix (El, the original projection equations 



= I -K,R\C,) 

(in extended coordinates) were replaced by the equations 

p(0) = ( /3 I 0 ) ^ and p(*) = ( A, I e« ) ^ 

where the 3-vector P is defined by 



(97) 



(98) 



KiiRl -KoR^Co 



(99) 



or equivalently, by 

P = KoRl (P-Co) . (100) 

In other words, by multiplying the original matrix on the right with the 
matrix (El, a change of coordinates is performed in the world space, making 
the world frame camera-centered for the first camera. As geometrically is clear, 
the images only depend on the set-up of the cameras relative to the scene, but 
not on the particular choice of the world frame. Consequently, the inter-image 
relations should not depend on the choice of the world frame either. Making 
the world frame camera-centered to the first camera will therefore not alter the 
multiview relations, but it certainly makes the algebra easier, as will become 
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clear in a moment. Coming back to equation ll9isll . after division by the non-zero 
scalar factor G IR, the projection equations of the other cameras become 

r«p« = (a, I e«) (101) 

with / pi*^ for all 1 < i < n — 1. This division was performed alge- 

braically by pre-multiplication with the inverse of the diagonal matrix II94II . 

Now let us verify that the vanishing of the determinants of the (m-|-4) x (m-|- 
4)-submatrices of matrix llhhll indeed yield the multiview constraints derived in 
the previous sections. Let us starts with the 2 view case. Taking features from 
all images, the number m of points and the number k of lines must sum up to 
the number n of views: i.e. m + k = 2 ] and, moreover, an n-view relation only 
occurs if 2 m -|- fc > 4. So, for two views, only a relation between (the image 
coordinates) of 2 corresponding points exists. Using the notations from section 0 
again, the appropriate matrix is 



/fy 0 p 0 \ 

e' 0 p7 ■ 



(102) 



This matrix being square, the rank condition is equivalent to the vanishing of 
its determinant: 



det 




0 p 

e' 0 




(103) 



Denote this determinant by Z\ 2 . Then Z \2 is easiest calculated by using Laplace’s 
rule, which is a generalisation of developing a determinant along a row or col- 
umn. In particular, developing Z \2 with respect to the first three rows gives the 
following non-zero terms: 



L \2 = (-l)l+2+3 { (-l)l+2+5 I Ui U 2 p I I ag e' p' I 

+ (_l)i+3+5|u, U 3 p||a 2 e'p'| (104) 

+ (-l)2+3+5|u2U3p||aie®p'|} , 

where a^ is the ith column of the matrix A, Ui = (1,0,0)*, U 2 = (0,1,0)* 
and Ug = (0, 0, 1)* are the standard unit vectors in , and with vertical bars 
denoting the determinant of the 3 x 3-matrix whose columns are the specified 
column vectors. As p = {x, y, 1)* and | a^ e' p' | = — | p' e' a^ | = — p'*(e' x a^ ), 
Z \2 is equal to 

Z \2 = -1 p'*( e' X ag ) - 2/ p*‘( e' x a 2 ) - x p'*( e' x ag ) 

= -p*‘ [e* X (xai -Hpa 2 -k lag)] (105) 

= -p*‘(e'x Ap) = -p*‘[e']xAp . 



Putting F = [e'jxA again, condition lll);ill . i.e. A 2 = 0, equals the epipolar 
constraint II 1 411 expressed in Theorem [I] 
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Next, consider the case of 3 views. Taking m points and k lines in 3 views 
means that m + fc = 3 ; and, to have a 3- view relation, 2 m + k > 4. So, for 
three views, only the following type of relations exist: 3 points, 2 points / 1 line 
and 1 point/ 2 lines (cf. Table QJ. Notice that the option “3 lines'^ is not found 
here, because the combination m = 0 and fc = 3 does not satisfy the condition 
2m + fc > 4. At first sight this may seem contradictory to Proposition Q and 
Corollary^] but it is not. Indeed, PropositionQ]and Corollary^only are valid for 
corresponding lines in the images (i.e. image lines that are the projection of one 
and the same 3D line in the scene). The image lines considered in this section 
do not have to be corresponding to each other. They only need to contain the 
image point that corresponds to the point(s) selected in the other images; but, 
apart from that, they are completely arbitrary. This latter situation does also 
apply for the 3 points, 2 points / 1 line and 1 point / 2 lines constraints referred 
to in Table ^ 

Let us first look at the “3 points” case. Using the notations from section E] 
again, the appropriate matrix (jt)5l) is 

//3 0 p 0 0 \ 

A e' 0 p' 0 . (106) 

e" 0 0 p" j 

This matrix has 9 rows and 7 columns. The rank condition therefore is equivalent 
to the vanishing of the determinants of all its 7 x 7-submatrices. In other words, 
2 rows have to be deleted from this matrix to obtain a constraint; and, moreover, 
to get a relation involving all 3 views, these 2 rows should be taken from different 
camera matrices. As only 2 rows may be deleted, it is clear that one camera 
matrix will be untouched. This explains the special role of one image when 
compared to the other two views in the trifocal constraints. In section 0 the 
first image I served that role. Deleting one row in the A- and one in the B-line 
of matrix (dH) and computing the determinant of the resulting 7 x 7-submatrix 
yields one of the 9 constraints in equation of PropositionEl as can be verified 
by explicit calculation. Here, however, we follow an alternative strategy, namely 
to study the effect on the computation of the determinant of deleting one row 
in the 3 x (m -I- 4)-submatrix corresponding to one camera in matrix llpoll . 

Lemma 2. With the notations of this section: The constraint involving the im- 
age point p*^®^ in image obtained by deleting the first (respectively, the sec- 
ond) row of the 3 x (m -I- A) -submatrix 

( Ai eW 0 ... 0 p« 0 ... 0 ) (107) 

of matrix (E21) in order to form a {m -|- 4) x (m -|- A) -determinant, is equivalent 
to the constraint obtained by chasing the horizontal line (respectively, the 
vertical line v *^®) ) through that point p^®\ instead o/p*^®^ itself, in image /(®). The 
constraint obtained by deleting the third row of submatrix is equivalent to 

the constraint obtained by chasing the line through p^®) and the image origin. 
Homogeneous coordinates of this line are given by the third row of the skew- 
symmetric 3 X 3-matrix [ p^®^ ] ^ . 
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Proof. First of all, notice that the column vector in camera submatrix (il DVil 
is part of the {i + 5)th column of matrix llhfili : and, moreover, that these are the 
only non-zero entries in that column. If the first row of camera submatrix dinzj is 
deleted to form a (m-|-4) x (m-|-4)-submatrix of matrix Hflfili . then the remaining 
rows of (Cnzi) are 



/ (AO 2 , 0 ... 0 2/(®) 0 ... 0 \ 

(A,)3^ 1 0 ... 0 1 0 ... 0 J 



where {Ai)k* denotes the fcth row of matrix Ai and we also used that = 
( , y*-®^ j 1 )* and e^®) = ( Xe '^ , > 1 )* • Remember from linear algebra that the 

value of a determinant does not change when a scalar multiple of one row of the 
matrix is added to another row. Hence, the determinant of the (m-|-4) x (m-|-4)- 
submatrix will not change if one subtracts y^®^ times the second row from the 
first one in camera submatrix (uni. The resulting submatrix is 



/ - y^^'> (A^)3* ye ^ - o . . . o o o . . . o \ 

^ (AO3, 1 0 ... 0 1 0 ... 0 y ’ 



(109) 



Due to this operation, the 1 in the {i + 5)th column of submatrix I|1 1)1)1) becomes 
the only non-zero entry in that column of the (m -I- 4) x (m -I- 4)-submatrix of 
matrix IDhl) . Developing the determinant of that (m -I- 4) x (m -I- 4)-submatrix 
along this (i-|-5)th column thus results in a (m-|-3) x (m-|-3)-determinant obtained 
by deleting the (i-|-5)th column corresponding to the image point p^®^ as well as 
the last row of submatrix (1 1 1 )HI . Put differently, the point p^®) has disappeared 
completely from the determinant, and the submatrix ill l)Hil corresponding to the 
(i -|- l)th camera has been replaced by only one row, viz. 

( (AO2* - y^'^ (^z)3* ve^ - y^'^ 0 ... 0 0 ... 0 ) . (110) 

Recall from formula m that the horizontal line hO) through pO) in the (i-l-l)th 
image I*-®) has homogeneous coordinates hO) ~ (0, —1, y0))‘. Pre-multiplying the 
matrix Ai with the transpose of hO) gives 

h«‘Hi = -(H,)2* + 2/«(A.)3* , (111) 

which, up to sign, is the first part of row 111 I l)ll . Furthermore, pre-multiplying 
the epipole eO) = (xe\ y^\ 1)* with the transpose of hO) gives 

h(®)‘e« = -y« -hy(®) , (112) 



which, up to sign, is the 4th entry of row If™ . So, row (nm), up to sign, equals 

( h(®)*e(®) 0...0 O... 0) , (113) 

which is nothing else but the single row corresponding to the (i + l)th camera, 
which would have been included in the original matrix llt)t)ll if the (horizontal) 
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line was chosen in image instead of the image point This proves 

the claim for deleting the first row in camera submatrix (Cna). 

The claim for deleting the second row in camera submatrix (ITITTIl follows in 
exactly the same manner, but using the vertical line through p*^®) in image 
/^®\ which has homogeneous coordinates v^®) ~ (1, 0, — cc*-®^)*. Finally, removing 
the third row in camera submatrix dMI) and repeating this type of reasoning 
leads to replacing the point p*^®^ in image /^®^ by the line with homogeneous 
coordinates ( — y^®\ x^®\0)*. This line is the line through the point p*^®^ and the 
origin in image /(d. □ 

Since the last row of the skew-symmetric 3 x 3-matrix 







( “ 


-1 y(®) \ 


p(d' 


= 


1 


0 






X 




a:(® 


0 j 



is a linear combination of the first two rows, as shown in formula (ESJ; and, 
because the first two rows of this matrix contain homogeneous coordinates of 
respectively the horizontal line h*^®) and the vertical line v*^®^ through the point 
p*^®\ it follows from Lemma 0 that the constraint obtained by deleting the last 
row in a camera submatrix is a linear combination of the constraints ob- 

tained by respectively removing the first and the second row of that camera 
submatrix. This proves the following proposition. 



Proposition 6. The linearly independent determinant constraints involving im- 
age points p^®^ can be obtained from the determinant constraints on image lines 
by replacing the line 1^®^ in view I*-®^ by respectively the horizontal line h^®^ and 
the vertical line v*-®^ through p^®^ in /*-®\ as was done in the previous sections. In 
particular, all linearly independent n-view relations can be derived by this proce- 
dure from the n-view relation involving the maximal number of image lines. □ 



In practice, since for a n-view relation involving m points and k lines one 
needs m-\- k = n and 2 m -|- fc > 4, it follows from Proposition El that for 3 views 
the relation involving the maximal number of lines is the one with 1 point and 
2 lines; whereas that “fundamental” relation for 4 views is the one involving 
4 lines only. To prove the claim made at the beginning of this section — viz. 
that all the constraints given by the (m -I- 4) x (m -I- 4)-determinants described 
in section in. II are equivalent to the multiview relations derived in the previous 
sections — it suffices to show that the “f point / 2 lines''^ relation in the 3- 
view case corresponds to the fundamental trifocal constraint (HID of Theorem 0 
and that the lines’'^ relation in the case of 4 views is just the fundamental 
quadrifocal constraint llY4ll of Theorem 0 
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First consider the “1 point / 2 lines” situation for 3 views. The appropriate 
matrix for this case is 



/ ^3 



\ 1"‘S l"*e" 0 



(115) 



where the notations of section 0 are used again. This matrix being square, the 
rank condition on this matrix is equivalent to the vanishing of its determinant: 



det 



f h 0 I 

V*A r‘e' 0 1=0. 

\ 1"‘S l"‘e" 0 



(116) 



Denote this determinant by A 3 . Then A 3 is easiest calculated by using Laplace’s 
rule again. In particular, developing A 3 with respect to the first three rows gives 
the following non-zero terms: 



^3 = (-l) 



1+2+3 



(- 1 ) 



1+2+5 



Ui U2 p I 



r‘a3 

l"‘b 3 l"‘e" 



+ (-l)l+3+5|ui U 3 p 

-h (-l)2+3+5 |U 2 U 3 p 



r‘a 2 l'*e' 
l"‘b 2 l"‘e" 



(117) 



r‘ai 

l"‘bi l"‘e" 



where a^ and b^ are the ith column of respectively the matrix A and B, and with 
Ui, U 2 and U 3 the standard unit 3-vectors, as before. Using that l"*e" = e"*l" 
and \"*hi = h\ 1" for all i, the 2 x 2 -determinants in the previous equality can be 
written as 



r‘a, 

l"‘b, l"‘e" 



= ( l'‘a, ) ( e"‘l" ) - ( r‘e' ) ( b* 1" ) = 1 



ft 



As p = (x, y, 1)*, determinant (jilTjl is equal to 



Z \3 = 1 1'' [ a 3 e"‘ - e' b| ] 1" + y F' [ aa e"' - e' b‘ ] 1" + xl'"[ ai e"' - e' b‘ ] 1 



fft 



a, e" — e' bj 



f/t 



1" . 

(118) 



= F 



= 1 ' 



( a: ai -I- y a 2 -I- 1 as ) e"‘ - e' ( x bi -f y b 2 -t- 1 bs )‘ 



(Ap)e"‘-e'(i?p)‘ 



(119) 



Putting [Tp] = ( Ap ) e"* — e' ( Bp )* as in formula (I I Ml . the condition /is = 0 
equals the fundamental trifocal constraint (II YU expressed in Theorem El This 
proves the claim for the 3- view case. 
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In case of 4 views, on the other hand, the fundamental quadrifocal con- 
straint m in Theorem 0 was proven precisely as explained in this section. So, 
nothing more has to be said about this case. Having shown that all the relations 
between multiple views of a static scene are covered by the constraints derived in 
the previous sections also concludes our guided tour through multiview relations. 
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